ED 214 952 



DOCUMENT RESUME 



TM 820 098 



AUTHOR Scriven, Michael 

TITLE , Evaluation Thesaurus. Third Edition. 

REPORT NO ISBN-0-918528-18-6 
PUB DATE Dec 81 

NOTE . > 183p. 

AVAILABLE FROM EdgepreSs, Box 69, Pt. Reyes, CA 94956 <$*.75 single 

copy; $6.20 two or aore copies K 



EDRS PRICE. 
DESCRIPTORS 



It 
It 
as 'to 



. MF01 Plus Postage. P6 Not Avail abe iron EDRS.. 
♦Definitions; *Evaluation^ Vocabulary 

ABSTRACT ^ " 

* This is a thesaurus, of terms used in evaluation, 
is not restricted in scope to educational or program evaluation, 
refers to product and personnel and, proposal evaluation as well 
quality control and the grading of work samples. The text contains 
practical suggestions and procedures, comments and criticisms, as 
well as definitions and distinctions. The criteria for inclusion of 
an entry were that at least a few participants in workshops or 
classes requested it; a 'short account was possible; the -account was 
found useful; or the author .thought it should be included for the 
edification or amusement of professionals and/or amateurs. Only key 
references are provided. The scholar nay find more references in' the 
few that are given as that was. a criterion for Selection. This itork 
is one of a-series published by the author in 1981-82. Some of the 
other works deal with evaluation in specific areas. (DWH) 

• . • \ 



**************************************************** ********** ********* 

* Reproductions supplied' by IDRS art the beat that tan ba madf * 

* £ro« the original idoduaant. * 
*******************************************************'**************** 



EVALUATION 

third edition 

THESAURUS 



MATERIAL N MICftOHrHt ^ 

HAS BEf *^ 'j'-A'^t" 



TO IM?. ^01'"^ - >J ' 



MICHAEL 
SCRIVEN 



ERIC- 



* US DEPARTMENT OF EDUCATION 

NATIONAL INSTITUTE OF EDUCATION 

fD'ft A'!.»\Ai *UM)UR^ES "^foRMATiON 

f»ts J 'it h is h*H»r *«prijt]i«' ffi «ls 
• ' • 4 n"s< ' >• if j,|f)i/iitif>n 
I ' i* * ' 

* "I - • !•,.» r V 

•*'».• ' v . A " i(.if.tf.ris si iff i .( 'h,.,,fo< U 

[itlSltiti'i >' y 



EVALUATION 

third edition 

THESAURUS 



MICHAEL 
SCRIYEN 



-* 



EDQEPRESS ' INVERNESS CALIFORNIA 



EMC ^> 3 



I 



International Standard Book Number 0-918528-18-6 

ftrst edition/1977 

Second edition, September 1980 

Third edition, December 1961 



^ Copyright © 1981, Michael Scriven 



Library of Congress 
CIP Number 80-68775 



No part of this book may be reproduced by any mechani- 
cal, photographic, or electronic process or in #e form of a 
phonographic recording, nor may it be sfpred in a retrieval 
system, transmitted, or otherwise copied- for public or 
private use, v^thout permission from the publisher, 

Edgepiess 
3oxfc),Pt Reyes 
California 94956 

Printed in the United States of America 

Single copies $7.00 + postage etc. 75* 

Two or more, 20% off price & postage (S6.2C each/total). 

California tax for California orders only: 6% 

(46* per single copy, 36* per copy on multiple orders. ) 
Vfastercard or Visa acceptable . 

^Free copies of fourth edition to contributors who identify 
O jnificaht additions and errors. 




INTRODUCTION 
TO THE THIRD EDITION* 



Evaluation is a new discipline tjjcifgb an old practice. It is 
not just * science, though there is a point to talking about 
scientific evaluation by contrast with unsystematic orsub- 
jective 'evaluation. Disciplined evaluation occurs in schol- 
arly book reviews, the Socfii tic dialogs, social criticism and 
in the opinions handed down by appellate courts. Its char- 
acteristics are the drive for a determination of merit, worth 
or value; the control of bias; the emphasis on sound logic, 
factual foundations and comprehensive coverage. That it 
has become a substantial subject is attested to by the exis- 
tence of several encyclopedias, two professional associa- 
tions with over a thousand members each, half a dozen 
journals, and Scores of anthologies, texts, monographs, etc. 
It is a subject in its own right, not to be dissipated in 
sub-headings under education, health, taw-enforcement 
anfl £p on; 6ne might as well argue that there is no subject of 
statistics, only agricultural statistics, statistics in biology, 
etc. Nor will 'it do to classify; evaluation tinder "Social 
Sciences, Sundry" since evaluation far transcends the social * 
sciences. The Library of Congress has recognized this in the ^ 
past year by allocating a special classification for general 
works in evaluation, paralleling the one for general works 
on research . (The second edition of this book was the straw 
that broke the back of many years' resistance.) 

It has .been common amongst sotpe wprkers to speak of 
"evaluation research" rather than "evaluation," to convey 
the distinction between casual and scientific evaluative in- ' 
vestiggtions. The redundancy is something we should in 
general now put behind us, leaving it to the context to make 



the distinction dear, as we do with terms like "m&iicine," 
"diagnosis" and "explanation." 

A more important distinction to watch for is that between 
doing evaluations and developing or discussing the meth- 
ods and models in evaluation.. Most evaluations of the ex- 
pansionist period since the mid-^ps have been merely appli- 
cations of, quantitative social science techniques to some 
naiy^Tnodel of what an evaluation shoufd be e.g. a test of 
tfie hypothesis that the intended results vtfere achieved/ 
What makes evaluation a legitimate autonomous discipline 
is the realization that such a model is completely wrong, a 
realization springing out of the discussion of models and 
methods initiated by the educational evaluaiors, not ttfose 
cdming from sociology, psychology, etc. The discussion of 
evaluation methodology is the same selt-analytical practice 
that led to making a science out of metallurgy af}£r six 
thousand years of smelting, casting, annealing, and forging 
skills, and to great p'radicr* advances immediately. Re- 
search without reflection on the methods and models is 
only appropriate when the latter are beyond reproach and it 
.will take evaluation a long time to live down the immature 
^orgies of the seventies. 

This small work may serve as a kind of miniature text- 
cum-reference-guide to the field.' It developed from a 1977 
pamphlet with the ^ame title, and the dictionary definition 
of tlte term "thesaurus"—rather than Roget's exemplar- 
still applies to this much larger, more detailed, and mas- 
sively rewritten work: "a book containing a store of words 
or information qbout a particular field or set of concepts'* 
(Webster III); "a treasury or storehouse of knowledge" (Qt- 
ford English Dictionary). We already have encyclopedias in 
sub-fields of evaluation (Educational evaluation and pro- 
gram Evaluation), and many texts provide brief glossaries. 
But for mbst consumers, the texts and larger compendia 0 
contain more than they want *n know or care to purchase— 
for they are indeed expensive. The glossaries, on the other 
hand, are' too brief- Here then is a smaller a nd cheaper guide 
than the encyclopedias, yet one that is more comprehensive 
than the glossaries since it is not restricted simply to educa- 
tional evaluation or to program evaluation. It also refers to 
product and personnel and proposal evaluation, to quality * 
control and the grading of work samples, and to many of the 

• 



other areas in which disciplined evaluation is practiced. It 
tonNgs practical suggestions.and procedures, comments' 
and criticisms, as well as definitions and distinctions. 
Where it functions as a dictionary, it is in the tradition of 
Santael Johnson's English Dictionary rather than the 
mighty OED; academic presses would not have approved 
his definition of oats ("A grain, which in England is gener- 
ally given, to horses, but in Scotland supports the people"). 
Where this serves as a reference to good practice and, not 
fust good usage, it is of course briefer than the special texts * 
or encyclopedias, but it may provide a godd starting-point 
for an instructor who wishes to focus pn certain topics in 
considerable detail and to provide tailored readings cj£ 
those, wftile ensuring that students have some source 'for 
untangling the rest of the complex conceptual net that cov- 
ers this field. Students have even read it cover to cover as a 
way to iteview a semester's course in two days. 
- Smaller than the other texts, yes; more judgmental be- * 
yond doubt. But also possibly more/open to change; we 
print short runs at Edgepressso that updating doesn't have 
toxojmpete with pr o tection of inventory.. Se\x* iivyour cor- 
rections or suggestions, and receive a frhe copy of the next 
edition. The nfost substantial or numejrous suggestions also 
earn the choice of a handsome book on evaluation from our 
stock of sp<fres. •( At this writing, we have spares of both 
en^clojfedias and twenty other weighty volumes.) 

The criteria for inclusion of an entry were: (a) at leas^t a few 
participants in workshops or classes requested it; (b>a shert, 
accotfht was'possible; (c) tfie account was found useful, 
or— in a few cases— (d) the author thought it should be 
included for the edification or amusement of professionals 
and/or amateurs. There is more current slang and jargon in 
here than would usually be recognized by a respectable 
scholarly publication— but that's exactly what gives people 
the most trouWe. (And besides, though some of the slang is 
unlovely, some of it embodies thq poetry and imaginative- 
ness^ a new field far better than mbre pedestriaff and 
technical prose.) There's not much ontfhe solid statistics and 
measurement material — because* that's ifery well covered 
elsewhere— but there's a little/ because participants in 
some mservice workshops for professionals have no statis- 
tical background and find these few definitions helpful. 

■ * 

tit 



f 



There's a goofrdeal about the federal/state contract process 
because thatfs the way touch of evaluation is funded (and 
because its^rgon is especially pervasive and mysterious). 
Some references are provided— but only a few key ones, 
because too many just leave the readers' problem of selec- 
tion unanswered. The scholar will usually fin^fnore refer- 
ences in the few given; that was one criterion for selection of 
* them. Acronyms, besides a basic fe\y, are in a supplement, 
to reduce clutter, the list of entriejp has benefited from 
* comparison with the Encyclopedia of Educational Evalua- 
tion (edsJ Anderson, Bafl, Murphy el al, Jossey-Bass, 1976); 
but there are over 120 substantive entries here that are mjfcin 
EE£ , The third edition has revisions on every page< m^ny of 
them extensive, and a score of new entries, (and several 
dozen fewer typographical errors). About five thousand 
words have been added. . ,„ ' 

The University of San Francisco, through its support of 
the Evaluation-Institute, deserves first place in agisting of 
indebtedness. In 1971-72 the U.S. Office pf Education (em- 
bodied in John Egermeier)*was kind enough to support me 
^irrdev elopingan dgivtngairaining program in what I then 
Called Qualitative Educational Evaluation at the University 
of California at Berkeley, and there began the glossary from 
which this work grew. Two contracts with Region IX of , 
HEW, to assist in building staff evaluation capability, led me 
from giving workshops there to developing materials which^ 
can be more widely distributed, more detailed, and used for 
later reference more often than £ student's seminar notes. 
My students afld contacts Jn those courses and workshops, 
as others at Berkeley, Nova, USF, AERA, Capitol and on 
many distant campuses have been a constant source of 
improvement— still needed— in formulating and covering 
I- this exploding and explosive field; «nd my colleagues and 
\ clients too. To all of these, many thanks, most especially to 
Jane Roth for her workon the original Evaluation Thesaurus 
which she co-authored in 1977, and to Howard Levine for 
many valuable suggestions aboutthe first edition. Thanks, 
too, to Sienna S'Zell and Nola Lewis for handling the com- 
plexities of getting this into and out of our Mergenthaler 
phototypesetters. They are not to blame for our minor ef- 
forts to reform punctuation e.g.* by (usually) omitting the 
commas around "e.g." since it provides its own pause in the 

ERJC * 8/ , 



flow; and cutting dowiVgp^htfuse af single quotes, since the 
U.S. and BritishjjradJces are reversed 

This wockrjrdhe of a series to come out in 1981-82. A long 
rronogjafjlv only avail^Ie in an inexpensive field edition 
Jcwtiiat (typeset but stapled and not indexed), The Logic of 
Evaluation, is complete and became available in eprly De- 
cember, The Evaluation of Composition Instruction withr 
Davis and Thomas, a project supported by the Carnegie 
Corporation, came out in November. An introduction to 
evaluation, to be called Principles and Practice of Evalua- 
tion, is scheduled for February "82^Longer studies' pub- 
lished during this period include one on product evalution 
in New Techniques for Evaluation (ed. Nick Smith/Sage, 
1981); one on the evaluation of "teaching and teachers is in 
Handbook of Teacher Evaluation (ed. Jay Millman, Sage, 
198*); one on the evaluation of educational technology is in 
the Future of Education, (ed. Kathryn Cirincione-Coles, 
Sage, 1981); and a monograph "Overview at Evaluation" 
(ftx;ussed on educational evaluation) in Proc. Nat Acad, of 
Education. Books are also projected for this period on per- 
sonnel evaluation and the evalua t ion o f info r matio n tech- 
nology. ) We have also published several issues of a newslet- 
ter on methodologicaftssues, evaluation notes. Where more 
detail on a topic referenced in the thesaurus is provided in 
the first of these monographs, the abbreviation LE(for Logic 
of Evaluation) is used . 

Reyes Ridge 
Inverness 
California , 

Dcce\nber 198T 



Terms printed in bold type have their own entry; but this 
slightly distracting flag is not waved more than once in any 
% ?ntry. 

v m ACCOUNTABItlTY' Responsibility for the justification 
of expenditures, decisions, or one's own efforts. Ttftis pro- 
gram managers and teachers sHJpuld be, it is often said, 
accountable for their costs and salaries and time, or account- 
able for pupils' achievement. The term is also used'to refer 
to a movement towards increased justification. Accountabil- 
ity thus requires some kind of cost-effectiveness evaluation; 
^ it is not enough that one be able to exploit} how one spent the 
money ("fiscal accountability"), but it is also expected that 
one be able to justify thisrin terms of the achieved results. 
p Teachers have sometimesbeen held whotly accountable for 

1 their students' achievement scores, which is of course en- 

tirely inappropriate since their contribution to these scores 
is only one of several factors (support from parents, from 
peers, and from the rest of the school environment outside 
the classroom are the most frequently cited other influ- 

— ences). On the other hand, a teacher can appropriately be 
held^accountabfe for the failure to produce the same kind of 
learning gains in his or her pupils that other teachers of 
essentially similar pupils achieve. A common fallacy associ- 
ated with accountability is to suppose that justice requires 
the formulation of precise goals and objectives if there is to 
be any accountability; but in fact one may be held account- 
able for what one does, within even the most general con- 
ception "of professional work, e.g. for "teaching social 
studies in the twelfth grade," where one might be picking a 
fresh (unprescribed) topic every day, or every hour, in the 
light of one's best judgment as to what contemporary social 
events and the class capabilities make appropriate. Less 
specificity makes the judgment more difficult, but not impos- 
sible. Captains of vessels are held accountable for their 
actions in wholly unforeseen circumstances. It is true, how- 
ever, that any testing process Has to be very carefully 
selected and applied ^educational accountability is to be 
enforced in an equitable way; this does not mean that the 
test must be matched to what is tatighf 'because what is 
taught may haveTajen wrongly chosen), butit does mean 
that the test must be very carefully justified e.g. by reference 




10 



> 



to reasonable expectatipns as to what should have been (or 
could justifiably have been) covered, given the need and 
ability of the students. • * * 

ACCREDITATION The award of credentials, in partic- 
ular the award of membership in one of the regional associa- 
tions of educational institutions or one of the professional 
organizations which attempt to maintain certain quality 
standards fokraembership. The "accreditation process" is 
the process wnereby these organizations determine eligibil- 
ity for membership and enc6urage self-improvement to- 
wards achieving or maintaining that status. The accredita- 
tion process has two phases; in the first, the institution 
undertakes a self-study and self-evcjuation exeroise against 
its own mission statement In the second phase the regional 
accrediting commission sends in a team of people familiar 
with similar institutions, to examine the self-study and iis 
results, and to look at a very large number ef particular 
features of the institution, using data to be supplied bytthe 
institution together wifh a checklist {Evaluative, Criteria is 
4he£estJ*nown of these, published by The National Society 
for School Evaluation), which are then pulled together in ar. 
informal synthesis process. At the elementary level, schools 
are typically not visited (although there is one of the handful 
f of regional accrediting commissions that is «n exception to 
this); at the high school level a substantial team visit is 
involved, and the same is true at the college level. Accredit- 
ing of professional schools, particularly law schoofe and 
medical Schools, is also widespread and done by the rele- 
vant professional organizations; it operates in a similar way, 
Accrediting of schools of education that awaru credentials, 
e.g. for teaching in elementary schools, is doni by the state; # 
there is also a private organization (NCATE) which evalu- 
ates sjich schools. There are grave problems with the ac- 
creditation process as currently practiced, in particular its 
pendency towards the rejection of innovations simply be- 
cause they are unfamiliar (naturally this is denied); its use of 
teams unskilled in the now-accepted Standards for serious 
program evaluation; its disinterest in looking at learning 
achievements by. contras^ with process indicators; the in- 
consistency between its practiceand the claim thatit accepts 
the institution's own goals; the shared-bias problerrj; the 

ERLC H, , .2 



/ 

* brevity of the visits; the institutional veto and middle-of- 
the-road bias in selecting team members; the lapKof concern 
with costs; and so on (If). See Institutional Evaluation. 

ACHIEVEMENT vs. APTITUDE VtUE APTITUDE/ 
ACHIEVEMENT DISTINCTION) It's obvious enough 
that there's a difference between the two; Mozart presum- 
ably had more early aptitude for the piano than you or I, 
even if he'd never been shown one. But statistical testing 
methodology has always had a hard.time*over the distinc- 
tion because statistics isn't subtle enough to-cope with the 
point of the distinction, just as it isn't subtle enough to cope 
with the distinction between correlation and causation. For 
no one has achievement who doesn't have aptitude, by 
definition, so there's a one-way correlation*; and it's very 
hard to show that someone has arr aptitude without giving 
them a test that actually measures (at least embryonic) 
achievement. Temerarious testing types have thus some- 
/ times been led to deny that there is any real distinction, 
whereas the fact is only that they lack the tools to detect it. 
Distin ctions O nly have to be Conceptuajly clear* not statisti- 
cally simple; aftd the distinction between a capacity (an 
aptitude) and a manifested performance (achievement) is 
conceptually perfectly dear. Empirically; we may never find 
good tests of aptitude that aren't mini-achievement tests. 
(Ref. The Aptitude Achievement Distinction, ed. Green, 
McGraw-Hiii.} 

ACTION RESEARCH A little-known sub-field in the 
social sciences that can be seen asa precursor of evaluation. 

* 

ACTORS Social science (and now evaluation) jargon 
term for those participating in an ^valuation, typically eval- 
uator, client and evaluee (if a person or his/her program is 
being evaluated). May also be used to refer to all active* 
stakeholders. 

# 

ADMINISTRATOR EVALUATION A spedes of per- 
sonnel evaluation which illustrates many of the problems of 
teacher evaluation in that there is no demonstrably superior 
administrative style (e.g. with respect to democratic versus 
authoritarian leadership), when the criterion of merit is 
effectiveness, rather than enjoyability. The three main com- 
"ponents of administrative evaluation should be: (a) anony- 



9 

'ERJC 



12 



. ERJC 



mous holistic rating or observed" performance as an admin- 
istrator, wiri) an opportunity to give reasons or examples, 
by all those "significantly interactive" with the individuals 
in question. Identifying this group — it never needs*to t|£ 
more than a dozen — is done by a prehmirv»ry request for a 
list from the administra* r to be evaluated, to which is 
attached the comment that the sea r ch will also be instigated 
from the groups at.the other end of the interaction; (b) a 
study of objective measures of effectiveness, e.g. turn- 
around time, on urgentl^Tequested materials, output indi- 
cators, staff turnover e(c; and (c) paper-and-pencil or in- 
terview or simulation tests of relevant knowledge and skills, 
in particular of new knowledge and understanding that has 
. become important since the time of the last review. This 
kind of evaluation can eafcfly be tied to in-service training, so 
- that it is a productive and supportive experience. The usual 
farce pf administrator evaluation via performance or be- 
havioral objectives is not oifly a prime opportunity for the 
t con artist to exploit, is not only indefensible because of its 
U ck of input from most ofcjhe people that have most of *he 
relevant evaluative knowledge, it is also highly destructive 
of creative management because of the lack of rewards for 
. handling "targets of opportunity" — ind&d, there are usu- 
ally de facto punishments for trying to introduce them as 
new objectives; (It also has the other weaknesses of any 
goal-hiscd evaluation.) It's acceptable as a fourth compo- 
nent with the same weight as the three above, if carefully 
managed. Administrators are often.nervous about the kind 
of approach listed as preferable here, because they rightly 
understand that most of the people with whom they in- 
teract have a pretty poor grasp of the administf ator's exten- 
sive responsibilitiesand burdens. The questionnaire must 
of course rather carefully delimit the requeslerf^esponse to 
rating (holisticall/) ihe observed behaviors, and the rest^f 
the objection is taken care of by the comprehensive nature 
of the group responding (peers, superiors, and subordi- 
nates), supplemented by the Objective measures. 

ADVOCATE-ADVERSAR* EVALUATION, (THE AD 
VERSARY APPROACH) A type of evaluation in which, 
^during the process and/or, in the final report, presen tions 
are made by individuals or teams whose goal is tq provide 
the strongest possible case for or against a particular view or 



13 



evaluation of the program (etc.). There may or rrtay not be 
an attempt at providing a synthesis, perhaps by means of a 
judge or a jury or both. The techniques were developed very 
extensively in the early seventies, from the initial example 
in which Stake and Denny were the advocate and the ad- 
versary (the TCITY evaluation), through Bob Wolf, Murray 
Levine, Tom Owens and others. There are stilT great dif- 
ficulties in answering the question, "When does this give a 
better picture and wnen does it tend to falsify the picture of 
a program?" The search for justice — where we rely on the 
adversary approach — is not the same as the search for 
truth; nevertheless, there are great advantages about stat- 
ing and attempting to legitimate radically different apprais- 
als e.g. the competitive element. One of the most interest- 
ing reactive phenomena in evaluation was the effect of the 
original advocate-adversary evaluation; many members of 
the "audience" were extre: ely upset by the fact that the 
highly critical adversary report had been printed a^part of 
the evaluation. They were unable to temper this reaction by 
recognition of the equal legitimacy accorded to the advocate 
position. The significance of this phenomenon is partly that 
it reveals the enormous pressures towards bland evalua- 
tion, whether they are explicit or below the surface. In 
"purely logical" terms, one might think there wasn't much 
difference between giving two contradictory viewpoints 
equal status, and giving a merely neutral presentation. But 
the effect on the audience shows that this is not the case; 
and indeed, a more practically oriented logic suggests that 
important information is conveyed by the former method of 
presentation that is absent from the latter, namely the range 
of (reasonably) defensible interpretation.;. See also Relativ- 
ism, Judicial Model. 

ADVOCATE TEAMS APPROACH (Stufflebeam) Not 
to be ccraised with the advocate/adversary approach to 
evaluation, A procedure for developing in detail the leading 
option* for a decision maker, as a preliminary to an evalua- 
tion of them. Part oi the input phase in the CIPP model of 
evaluation. See also Critical Competitors. 

AESTHETIC EVALUATION Often thought of e.g. by 
social scientists as essentially the articulation of prejudice, it 
can involve a substantial objective component. See Archi- 



tectural Evaluation, Literary Criticism^ 

AFFECTED POPULATION A program, product etc. 
impacts the true consumers and it's own staff.' In program 
evaluation both effects must be considered , though* they 
have quite different ethical standings. At one stag?, it 
looked as if the Headstart^program cooid be justified (only) 
because of its benefits jjf| those it employed. f 

AFFECTIVE (Bloom) Original sense; pertaining to the 
domain of affect., Often taken to be the same as the domaift 
of feelings or attitudes. Since these,are sometimes confused 
with beliefs, it should be remembered that affect should also 
be distinguished from the cognitive afid psychomotor do- 
mains. For example, self-esteem and locus trf control are 
often said to be -affective variables, but many items or in- 
terview questions which are said to measure these actually 
call for estimate of self-worth and appraisals or judgments 
of locus of control, which are straight propositiona) claims 
and hence cognitive. Errors such as this often spring from 
the idea that the realm of valuing is npt proppsiK' nal, but 
merely attitudinal, a typical fallacy of the value-free ideol- 
ogy in social science, While some personal values*are evi- 
dent! attitudes and hence may be considered affect, some 
valuations — whether or not tfiey cause certain attitudes — 
are scientifically testable assertions. Note the difference be- 
tween "I feel perfectly capable of managing my own life, 
selecting an appropriate career and mate, etc." and I am 
perfectly capable, etc." (Or "1 feel this program is really 
valuable' for me" vs. 'This program is really valuable for 
me.") Qaimr about feelings are autobiographical and the 
error sources are lying and lack d^feelf-knowledge. Claims 
about merit are external world claims and verified or falsi- 
fied by evaluations. The use of affective measures, beyond 
the simplest expressions of pleasure, is currently extremely 
dubious because of (a).these conceptual confusions be- 
tween affect and cognition, (b) deliberate falsification of 
responses, (c) unconscious misrepresentation, (d) dubious 
assumptions majle by the interpreter, e.g. that increases in 
self-esteem are desirable (obviously false beyond a certain 
(unknown) point), (e) invasion of privacy, (0 lack of even 
x basic validation, (g) high lability of much affect, (h) high 
stability of other affect. Not long ago, I heard an expert^sav 



that ihe only known-valid measure of affect relates to locus 
of control and that is fixed by the age of two. He may have 
been optimistic. ► 

AFFIRMATIVE ACTION bften incorrectly seen as an 
adminisfrative imposition upon "proper" or "scientific" 
evaluation, or as an ethical requirement (equally separate 
frpm the real process of evaluation). This "add-on" percep- 
tion is one reason why women/hunorities are still de facto 
discriminated against even by those with th£ best inten- 
tions. The gross excesses of many affirmative action pro-* 
grams should not be allowecTto obscure the underlying 
scientific (as well as ethical) rationale for special procedures 
that equalize treatment of candidates from groups against 
which there has been sustained discrimination in the past. 
Avoiding disbarment of candidates for reasons of remote 
nepotism and avoiding irrelevant j©b- requirements are 
two examples of about ten that are required m order that the 
best candidate be selected. (Opponents of affirmative action 
think it necessarily represents a reduction of commitment to 
principle of selection oru merit.) See also Personnel 
Evaluation, artft evaluation ,?0&$, Nov. 198L 

ANALYTICAL (evrluation) * By contrast with holistic 
evaluation which might be called macro^aluation (by 
analogy with macro-economics) analytical evaluation is 
micro-evaluation. There are two main varieties: component 
evaluation and dimensional evaluation* It is often thought 
that causal analysis or remedial suggestions are part of 
an^tic (typically formative) evaluation, but they are not in 
fact part of evaluation at all, strictly speaking (L£). 

ANCHORING (ANCHOR POINTS) Rating scales 

that use numbers (e.g. 1-6, 1-10) or letters (A-F) should 

normally provide some translation of the labeled points-en 

the scale, or at least the end-points and mid-point. It is 
common, in providing these anchors, to confuse grading 
language with ranking language e.g. by defining A-F as 
"Excejlent . . . Average . . . Poor" which has two absolute 
and one relative descriptors, hence is useless if most of the 
evaluands are or may be excellent (or poor). Some, probably 
most, anchors for tetter grades create an asymmetrical dis- 
tribution of merit e.g. because the range of performances 
which D (potentially) describes js narrower than the B 



range; this, invalidates (though possibly not seriously) the. 
numerical conversion of letter-grades to grade points (L£). 
It may be a virtue, if conversions not essential. In another 
but relate4 sense of anchoring, itn^eans cross-calibration of 
e.g. several reading tests, so as to identify (more or less) 
equivalent scores. ' 

ANONYMITY Th£ preservation of the anonymity of 
respondents sometimes requires very great ingenuity. Al- 
though even bulletproof systems do not achieve honest 
responses from everyone in personnel evaluation, because 
of secret contract bias, leaky systems get honesty from 
almost no one. The new legal requirements for open files 
have further endangered this crucial source of evaluation 
input; but not without adequate ethical basis. The use of a 
/'filter" (a persftn who removes identifying information, 
usually the perlon in charge of the evaluation) is usually 
essential; a suggestion box, a phone with a recorder on it to 
which respondents can talk (disguising their voice), check- ■ 
lists that avoid the*necessity for (recognizable) handwriting, 
forms that can be photocopied to avoid watermark identi- 
fiers, money instead of stamps or reply-paid envelopes 
(which can be invisibly coded) are all possibilities. Typical 
' « further problems: What if you want to provide an incentive 
h * for responding— how can you tell who to reward? What if, 
■ f * like a vasectomy/ you wish to be able to reverse the anony- 
* myzing process (e.g. to get help to a respondent io great 
distress)? There are complex answers, and the questions 
illustrate the extent to which thisissue in evaluation design 
takes us beyond standard survey techniq les. 

, APPLES & ORANGES ("Comparing apples & 
oranges") Certain evaluation problen/* evoke the com- 
plaint, particularly from individuals trained in' the tradi- 
tional social sciences, that any solution would be "Uke<om- 
paring apples and oranges." Careful study shows that my 
true evaluation problem (as opposed to a unidimensional 
measurement problem) involves the comparison of unlike 
quantities, with the intent of achieving a synthesis. It is the 
nature of the «beast. On the other hand, far from being 
impossible, the simile itself suggests the solution; we do of 
course compare apples and oranges in the market, selecting 
the one or the other on the basis of various considerations, 

ER?C , 17 « 



4 * 

such as cost, quality relative to the appropriate standards 
for each fruft, nutritional value, and the preferences of those 
for whom we are purchasing. Indeed, we commonly con- 
sider two or more, of these factors and rationally amalga- * 
mate the results ifjto an appropriate purchase. While there 
are occasions on which the considerations just mentioned 
do not point to a single winner, and the choice may be made 
arbitrarily, this is typically not the case. Complaining about 
the apples and oranges difficulty is a pretty good sign that 
the complainer has not thought very hard about the nature 
of evaluation (£E), # 

. APPORTIONMENT (ALLOCATION, DISTRIBUTION) 

The process ,or result of dividing a given quantity of re- 
souftes between a setof competing demands e.g. dividing a 
budget between program- fhis is in fact the defining prob- 
lem of the science of economics, but one that is usually not 
addressed directly or not in practical terms within the eco- 
nomic literature, presunvih!)* because any solution requires 
making assumptions aboui the so-called "interpersonal 
comparison of utility " i.e. the relative worth of providing 
goods to different individuals. Tfius the value-free qoncep- 
* # Hon of the social science \akes it taboo to provide practical , 
solutions to the apportionment problem. (Aft exception is 
the Zero-Based Budgeting approach— one can hardly call it 
a literature, and it rarely gets referenced in an economics 
text.) Apportionment is a separate evaluation predicate, 
distinct from grading and ranking and scbring although all 
of those are involved in it; it is— like them— one very practi-^ 
cal way of showing one's estimate of relative Worth, and of 
all the evaluation predicates it is probably the closest to the 
1 decision makers' modal evaluation process. Various 
patently inappropriate ^Igjions are quite frequently use*', 
e.g. the "across-the-board lah^Thjs not only rewards the 
padding of budgets, and hence automatically leads to in- 
creased padding the following year, but it also results in 
some funding at below th< "critical mass" level, a complete 
waste of money. Another inappropriate solution involves 
asking program managers to make certain levels of cut; this 
of course.results in the blackmail strategy of. setting the 
critical mass levels too high, in order to get more than is 
absolutely necessary. The only appropriate kind of solution 

* * 




involves solhe evaluation by a person external to the pro-* 
gtaoi, typically in conjunction with the program rpanagej; 
and the first task of such a revie*v must be to eliminate 
anything that looks like fat in the budget. Later steps in the 
process involve segmentation of eajh program, identifica- 
tion of alternative articulations of the segments, grading of 
the cost-effectiveness of the progressively larger systems in 
each sequence of add-ons, and consideration of interactions 
between program components that may reduce the cost of 
each \jt certain points. Given an estimate of the "return 
value" of the monejfr(the good it would do if not used for- 
tius set of programs), and the ethical (or democratic) com- 
mitment to prima facie equality of interpersonal worth, one 
then has an effective algorithm for spending the available 
budget in the most effective way. It will typically iJethecase 
that some funding of each of the programs will occur (unless 
the minimum critical mass is too large), because of the 
declining marginal utility of the services to each of the 
(semi-overlapping) impacted populations, the long-term 
.advisability of retaining capability in each area, and the 
political considerations involved in reaching larger num- 
bers. The process just described, although independently 
developed, is similar to the procedure for zero-based bud- 
geting, an innovation of which the Carter Administration 
made a good deal in the first years of his presidency; but 
serious discussion of the methodology for it neve^seemed 
to emerge, and the practice was naturally well behind that. 
(See Evaluation News, Dec. 1978). At the informal but 
highly practical level, apportionment reminds us of 9ne of 
k the most brilliant examples of bias control methodology in 
all evaluation: the sohition to the problem of dividing an 
irregulaH^shaped portion of food or land into two fair 
shares— Yoh divid e, |nd I'll choose. This ts a micro- version 
of the "veil pf ignc/ance" or antecedent probability ap- 
proach to tM justification of justice and ethLs in Bawls, A 
Theory of Justice (Harvard, 1971), and Scriven, Primary 
Philbsdphy (McGraw-Hill, 1966). U is riot surprising that 
ethics and evaluation share a coriunon border her* since 
justice is often analyzed as a distributional concept. Appor- 
tionment may be logically reducible to a very complex com- 
bination of grading and ranking, on multiple scales; but the 
reverse is also as likely. In any case, it may be better to use 



ERIC 



19 » 



oqe possibly redundant predicate in setting up the logical 
foundations of evaluation, as we do in mathematics or sym- 
. bolic logic. 

ARCHITECTURAL EVALUATION Like the evalua- 
tion of detective stories and many levels (see lityary criti- 
cism), this field involves a framework of lope and a skin of 
aesthetics; it is frequently treated as if. only one of tijese 
components is important. The solution to the problems of 
a traffic flow and energy conservation, the use of durable 
. fixtures that are not overpriced, the provision of adequate 
floor-space &nd storage, meeting the requirements of ex- 
pansion, budget, safety and th* law; these are the logical 
constraints. The aesthetic are no less important and no 
easier to achieve. Unfortunately architecture has a 'poor 
record of learning by experience, i.e., poor evaluation com- 
v mitmeht; every new school building incorporates errors <# 
the simplest kind (e.g. classroom entries, at the front of the 
room) and colleges of architecture when designed by their 
. faculty not only make these errors but are of^and^videly 
thought to*be the ugliest buildings on thfeampus. (Cf. 
evaluated who write reports readable only by evaluttors,) 
It is significant that the Ford Foundation's brilliant concep- 
tion of a center for school architecture has, after several 
years' operation, sunk withou: a trace. 

ARCHIVES Repository, of records in which e.g. min- 
utes of key meetings, old budgets, prior evaluations and 
9thf r found data are located. 

ARGUMENTATION House has argued that evaluation 
is a form of argumentation (Evaluating With Validity, Sage, 

. 1980) arid hence that insights abouj/itmay be implicit in 
studies of reasoning such as the "New Rhetoric." One 
might add the informal logic movement literature. 

* 

ARTEFACT (or ARTIFACT) (of an experiment, evalua- 
tion, analytical or statistical procedure) An artificial result/ 
one merely due to (created by) the investigatory or analytic 
procedures used in an experiment an evaluation, or a sta- 
tistical analysis, and not a real property of the phenomenon 
investigated. (For an example, see Ceiling Effect.) Typically 
uncovered-rand in good designs guarded against— by 
using multiple independent methods of investigation/ 

^ analysis. 

ERIC n ZQ . • 



ASSESSMENT Often used as a synonym for evalua- 
tion, but seitetimes used to refer to a process thai is more 
focussed on quantitati\te and/or testing approach^; the 
quantity may be money 1as in real estate assessment), or 
numbers 4nd scores (as in relational Assessment of Educa- 
tional Progress), People sometimes suggest that assessment 
is less of a judgmentalafid more of a measurement pfocess 
than certain other kinds of evaluation; but it might be ar- 
gued that it is simply a case of evaluation in which the 
judgmeift is built into the numerical resets. Raw scores on a , 
test of no known content or construct validity would not be' 
assessment; it is only when the test is (supposedly) of basic 
mathematical competence, for example, that reporting the 
results constitutes assessment in the appropriate sense, and- 
of course the judgment of validity is the key evaluative 
component in this. 

ATTENUATION (Stat) In the technical sense this re-' 
fers to the reduction in correlation due to errors of 
measurement, 

ATTITUDE, EVALUATIVE See Evaluation Skills. 

ATTITUDES The compound of cognitive and affective 
variables descfibing a person's mental set towards another 
person, thing, or state. It m^y be evaluative or simoly prefe- 
rential; that is, someone may think that running is good for 
.you, or simply enjoy it, or both; enjoying it does not entail 
thinking it is meritorious, nor vice versa, contrary to many 
suggested analyses of attitudes. Attitudes are inferred from 
behavior, including speech behavior, and inner states. Ni 
one, including the person whose attitudes we are trying to 
determine, is infallible about attitudinal conclusions, even 
though that person is in a nearly infallible position with 
respect to his or her own inner states, which are not the same 
as attitudes. Notice that there is no sharp line between 
attitudes and cognition; many attituderare evinced through 
beliefs (which may be true or false), and^attitudes can some- 
times be evaluated as right or wrortg, or good or bad, in an 
objective way (e.g. attitudes towards "the world owing one 
a Jiving," work, women (men), etc.). See Affective. 

ATTRITION The loss of subjects in the experimental or 
control/comparison groups during the period of th^ study. 



21 



12 



This is often so large as to destroy the experimental 
desigiw60% loss within a year is not uncommon in the 
schools. Hence all choice of numbers in the groups must be 
based upon a good estimate of attrition plus a substantial ■ 
margin for error. 

AUDIENCE (in Robert Stake's sense) A group, 
whether or ifirtlhey are the ctient(s), who will or should see 
and may use or react to an evaluation. Typically there are 
several audiences, and typically an evaluation report or 
presentation will need careful planning in order to serve the 
seyeral audiences reasonably well. 

AUDIT, AUDITOR Apar^ from the original sense of 
this term, which refers to a check on the books of an institu- 
tion by an independent accountant the evaluation use of 
the term refers to a third party evaluation or external valua- 
tion, often of an evaluation. Hencte~and this is the stan- 
dard usage in California— an auditor may be a meta-evalu- 
ator, typically serving in a formative and summative role. In 
the more general usage, an auditor may be simply *n exter- 
nal evaluator working either for the sime client as the pri- 
mary evaluator or for another client. There me other occa- 
sions when the auditor is halfway between the original kjnd 
of auditor and an evaluation auditor; for example, the Audit 
Agericy of HEW (now HHS/ED) was originally set up to 
monitor compliance with fiscal guidelines, but their staff are 
now frequWtly looking at the methodology a ntf overall 
utility of evaluations. The same is true of'GAO and OMB 
"audits." 

BALANCE OF POWER A desirable feature of the f 
social environment of an evaluation/ summed up in the 
formula: "The power relation of evaluator, evaluee and 
client should be as nearly symmetrical as possible." For 
example, evaluees should have the right to have their reac- 
tions to the evaluation and evaluator(s) appended to the k 
report when it goes to the dient.'Similarly, the client should 
also undertake to be evaluated in the typical situation where ^ 
the cfrritract identifies Someone else as the evaluee. (School 
administrators who are not being properly evaluated have 
little right to have teachers critically evaluated.) Meta-eval- 
uatkm and goal-free evaluation a re both part of the Balance 
of Power concept. Panels used in evaluation should exhibit 



a balance of power, not a lack of bias as it is conventionally , \ 
perceived. There are both ethical and polttkaj/ practical rea* 
sons ft r arranging a balance of power. 

BASEIJNE (data or measures) Facts about the condi- 
tio^ or performance of subjects prior to treatment. The 
essential result of the pretest part of the pretest-posttest 
approach. Gatherihg baseline data is one of the key reasons 
for starting an evaluation Before a program starts, some- 
thing that always seems odd to budgetary bureaucrats. See 
Prefonru^ive ^ . 

^ BASIC CHECKLIST The multi-point, checklist for 
evaluating products, programs, etc., to be found under Key 
Evaluation Checklist, 

, 4 BEHAVIORAL OBJECFVES Specific goals of, e.g. a 
program, stated in terms which will enable their attainment 
to be checked by observation ortest/measurement. An idea 
which is variously seen as 1984/Skinner/dehumanizing, 
etc., or as a minimum requirement for the avoidance of 
empty verbalisms. Some people now use "measurable ob- 
jectives" to avoid the miasma associated with the connota- 
tions of behaviorism. In^qpral, people are now more toler- 
ant of objectives that ate somewhat more abstractly speci- 
fied, .provided that Wading verification/falsification condi- 
tions can be spelled out/ than they weje in the early days of 
the behavioral objectives movement, in the 1960s, This is * 
because the attempt to spell everything out (and skip the 
statement of intermediate-level goals) produces 7633 be- fy 
havioral objectives for reading, which is an incomprehen- 

' $ible mess. Thus educational research has rediscovered the 
reason for the failure of the precisely analogous move by 
positivist philosophers of science to eliminate all theoretical 
terms in favor of observational terms. The only legitimate 
scientific requirement here is that terms have a reliable use 
and agreed-upon •empirical tontent not a short translation into 
observational language — the latter is just one way to the 
former and not always possible. Fortunately, scientific 
training can lead to the reliable (enough) use of theoretical 
terms, i.e., they can be unpacked into the contextually* 
relevant measurable indicators upon demand. This avoids 
the loss of t&e main cognitive organizers above the tax- 
onomical level, and hence of all understanding, that would 

. 23 " • - 



result from the tota! translation project, even if it Were 
possible. Trie same conclusion applies to the use of some- 
what general goal statements. ^ 

BIAS A condition in an evaluation or other design, or in 
softie erf its participants, that is likely to produce errors; for 
example, a sample of the students enrolled in a school is 
biased against lower economic groups if it is selected from 
those present on a particular day since absenteeism rates are 
usually higher amongst tower economic groups. Hence, if 
we are investigating an effect that way be related to eco- 
nomic class, using^uch a sample would be faulty design. It 
•is common and incorrect to suppose that (strong) prefer- 
ences are biases, e.g. someone who holds strong views 
against the use of busing to achieve desegregation is often 
said to be biased. (See the glossary of Evaluation Stanford* f 
McGraw Hill, 1980; where bias is wrongly defined as "a 
consistent alignment with one point of vieiv") This is true only 
where the views are 'unjustified, \e„ involve or will prob- * 
ably lead to errors. It is not true if the views are merely 
controversial; one would scarcely argurf that believers in 
atoms are b&^/even though the existence of atoms is 
denied by Christian Scientists. One sometimes needs a 
judge in a dispute that is neutral or acceptable to all parties or to 
the attdiertes; this should be distinguished frornlinbiased. 
Being neutral is often a sign of error in a given dispute i.e. a 
sign of bias. Evaluation panels should usually include 
trained and knowledgeable people with strong commit- 
ments both for and against whatever approach, program, 
etc., isteing evaluated (where such factions exist) and no 
attempt should be m^de to select only neutral panelists at 
the usual cost of selecting ignoramuses or cowards and 
jetting superficial, easily dismissed reports. The neutral 
faction, if equally knowledgeable, should be represented 
just as any^rther faction. Selecting a neutral clmir may be 
good psychology or politics (and that is part of good evalua- 
tion design, too), but not because s/he is any more likely to 
be a good judge. See also Shared Bias, Selectivity Bias. / 

BIAS CONTROL A key part of evaluation design; it is 
not an attempt to exclude the influence of definite views but 
of unjustified e.g. premature or irrelevant views. For ex- 
ample, theyise of (some) external evaluators is a part of good . 



# 



bias control, n<jt because it wilf eliminate the choice of 
' j>eople with definite views about the type of program being 
evaluated, but because it tends to eliminate people who are 
likely to favor it for the irrelevant (and hence error-condu- 
cive) reasons of ego-involvement or income-preservation 
(c£ also Halo Ef c ect). Usually, however^program managers 
avoid the use of ar> external evaluator with a known nega- 
% ' tive view of programs like theirs, even for formative evalua- 
tion, which'is to confuse bias with preference. Enemies are* 
one of thetetf sources of useful criticism; it's irrelevant that 
orf, doesn't enjoy it. Even if it is politically necessary to take 
account of a manager's opposition to the use of a negatively-, m s 
disposed summative evaluator, it should be done^by adding 
a second evaluator, also knowledgeable, to whom there is 
no objection, not by looking for someone neutral as such, % 
x since neutrality is jus^as likely to be biased and more likely 
• to be basecl on ignorance; a key point. The general principle • 
of bias control illustrated here ft the principle of balancing 
(possible) bias in a group of evaluators rather than selecting 
only "unbiased" evaluaters, which i* usually and wrongly 
interpreted as meaning uncommitted, i.e. ftll-too-often) 
ignorant or cowardly evaluators. (First, of course, one 
screens out everyone whose views are plainly biased, i.e., 
unjustified.) Other key aspects of bias control involve fur- 
ther separation of the rewards channel from the evaluation . 
reporting,, Resigning or hiring channel, e.g. by never allow- ' 
ing the agency monitor for a program tQ be the monitor for • 
the evaluation contract qn that program, ne^-r allowing a 
program contractor to be responsible for letting the CQntract 
to evaluate that program, etc. The ultimate bias of con- 
tracted evaluations resides in the fact that *he agencies 
which fund programs fund most or all of th«*ir evaluations, 
hence want favorable ones, a fact of which evaluation con- 
tractors are (usually consciously) aware and which does a 
great deal to explain tj>e vast preponderance of favorable 
evaluations in a world of rather poor programs. Even G AO, 
although effectively beyond this influence for most pur- 
poses, is not immune ; enough for Congress to regard them 
as totally credible, hence— M part— the creation pf the CBO 
(Congressional Budget Office). The possible merits of an 
evaluation "judiciary," isolated from most pressures by 
life-time appointment, deserve consideration. Anothei 





16 



principle of bias control reminds us of the instability of 

* independence or externality— today's external evaluatoris 
tomorrows co-author (or spurned conttibutor). For more 
• details, see "Evaluation Bias and Its Control/' in Evaluation 
Studies Review Annual (Vol. 1, 1976, ed. G. Glass, Sage). 
The possibility of neat solutions to bias contipl design prob* 
lems is kept alive in the face of the above adversities by 
remembering the Pie-Slicing Principle: "You slice arid 111 
select." See also Local Experts. 

BIG SHOPS The "big shdps" in evaluation are the five 
to ten thatcarr? most of the large evaluation contracts; they 

* include Abt Asscfiates, AIR, ETS, RANDpSDC, SRI, etc. 
- {for translations see the acronym appendix). The tradeoffs 

between the big shops and the small shops run something 
like this, assuming for the moment that you can afford 
» either, the trig shops have enormous resources of every 
kind, from personnel to computers; they have an ongoing 
stability that pretty well ensures the job will be done wtf h at 
least a minimum of competence; and their, reputation is 
important enough to them that they are likely to meet dead- 
lines and dp other good things of a paperkhurning kind like 
producing nicely boundreports, staging within budget and 
so on. In all of these respects they are a better bet, often a 
much better bet, than the small shops. On the othf r hand, 
you don'4 know who you are going to get to work for you in 
a big shop, because they have to move their project mana- 
gers around as the preSrof business ebbs and flows, and as 
their people jnove on to other positions; they are rather 
more hidebound by their own bureaucratic procedures than 
a small shop; and they are likely to be a good deal more 
expensive for the same amount of work, because they are 
carrying a large staff through the intervals between jobs 
which are inevitable, no matter how well they are run. A 
small shop is often carrying a proportionately smaller over- 
head during those times because its principals have other 
jobs, and may be working out of a more modest establish- 
ment, the staff taking some of their payments in the plea- 
sures of independence. It's much easier to get a satisfactory . 
estimate of competence about the large shops than it is 
about the small shops; but of course what you do learn 
-about the personnel of a small shop is more likely to apply to 
the people that do your work. There's an essential place for 

ERIC 17 
— \ ' 2C 



both of them; small shops simply can'* manage the big 
projects competently, although they s ometimes try; a nd tJjg* 
big shops simply can't afford to handle the small contracts^ 
If some more serious evaluation of the quality of the work 
done was involved in government review panels — and the 
increasing strength of GAO in meta-evaluation gives som$ 
promise of this — then small shops might St better into the 
scheme of things, rather as they do in the management 
consulting field and in the medifal specialties. We are buy- 
ing a lot of mediocre work for our tax dollar at the moment, 
because the system of rewards and punishments isset up to 
punish people that don't deliver (or get delivered) a report 
on time; but not to reward those who produce an outstand- 
ing report by comparison with a mediocre one. 

BI- MODAL (Stat.) See Mode. 

BLACK BOX EVALUATION A term, usually employed 
pejoratively, that refers to holistic summative evaluation, 
in which an overall and frequently brief evaluation is pro- 
vided,, without any suggestions for improvements, etc. 
Black box evaluation is frequently extremely valuable (e.g. a 
cc|nsumer product evaluation); is frequently far more valid 
than any analytical evaluation that could be done within the 
same time line and for the same budget; and has the great 
advantage of brevity. But there are many contexts ; n wh*ch 
it -simply will not provide the needed information e.g. 
where analytical formative evaluation is required. (Note 
that black box evaluation may even be extremely useful in 
the fonnative situation.) Cf. Engineering Model. 

BOILERPLATE Stock paragraphs or sections that are 
dumped intaKFPs or reports (e.g. from storage in a word- 
processor) to fill them out or fulfill legaljequirements. RFPs 
from some agencies are 90 percent boilerplate— one can 
scarcely find the specific material in them. 

BUDGET Regardless of the form which particular 
agencies prefer, it's desirable to develop a procedure for 
project budgeting that remains constant across projects so 
that vour own staff can become familiar with the categories, 
and to give you a basis for comparison. It can always be 
converted into a particular required format if it is thor- 
oughly understood. The main categories might be direct 



18 

?7 



labor costs, other direct costs (materials, supplies, etc.f, 
indirect expenses (space and energy costs), other indirect 
costs (administrative expenses or "general and administra- 
^yT^expenses (G&A)), The difference betw een ordinary 
/'overhead and G&A is not sharp, Imt the idea is that ordi- 
nary overhead should be those costs that are incurred at a 
rate proportional to staff salaries on the project, this propor- 
tion being the overhead rate, e.g. retirement, insurance, etc. 
G&A will include indirect costs not directly related to proj- 
ect or staff size (for example, license fees and profit). A 
number of indirect costs such as accounting services, frw 
terest charges, etc., could be justifiably put under either \ 
category. See Costs. 

CAI Computer Assisted Instruct., n. Computer pre- 
sents the course material or at least the tests on it. Cf . CMI. 

CALIBRATION Conventionally refers to the process of 
matching the readings of an instrument against a prior 
standard. In evaluation would include identification of the 
correct cutting scores (which de fine the grades ) on a new 
version dfa test, traditionally done by administering the old 
and the new test to the same group of students (half getting 
the old one first, half the new). A less common but equally 
important use is with respect to the standardization of 
J judges who are oxi e.g. a site-visit or proposal-reviewing 
panel They should always be run through two or three 
calibration examples, specially constructed to illustrate: (a) a 
wide nmge of merit; (b) common difficulties e.g. (in proposal 
evaluation) comparing low probability of a big pay-off with 
high probability of a modest pay-off. While it is not cmridi 
to get everyone to give the same rating (interjudge reliabil- 
ity), indeed pushing for it per se decreases validity, Ik is 
highly desirable to avoid: (a) gross intfa-judge inconsistency; 
(b) extreme compression of an individual's ratings, e.g. at 
the top, bottom or middle, unless the implications and 
alternatives are thoroughly understood; (c) drift of each 
judge's standards as they "learn on the job" (let them sort 
out their standards on the caiibtation examples); (d) the 
intrusion of the panel's possibly turbulent group dynamics 
imo the first few ratings (let it stabilize during the calibration 
period). While the time-cost of calibration may appear to be 
serious, in fact it is not, if the development of suitable scales 

O 79 

EMC 00 - 



aug anchor points is undertaken when doing the calibration 
examples, since the use of these (plus e.g. salience scoring) 
greatly increases speqd. And, if anyone really cares about 
validity^ pr ; interpaftel Yeliability (i.e., jusHce), is 
an essential step. See also Anchoring. 

CASE-STUDY METHOD The case-study method is at 
the opposite end of the spectrum of methods from the 
survey method. Both may involve intensive orrasual test- 
ing and/pr interviewing; observing, on the other hand, is 
more characteristic of case study method than of large-scale 
surveys. The case study approach is typical of the clinician, 
as opposed to the pollster; it is nearer to the historian and 
anthropologist than it is to the demog* ipher. Causation js 
usually determined in case studies by the modus operandi 
method, rather than by comparison of an experimental with 
a control groupy although one could in principle do a com- 
parison casg^fudy of a matched case. The case study ap- 
proach is frequently used as an excuse for substituting rich 
detail for evaluative conclusions, a risk inherent in respon- 
sive evaluation, transactional evaluation and illuminative 
evaluation. At its best, a ease study can uncover causation 
where no statistical analysis could; and can block or suggest 
interpretations that are far deeper than survey data can 
reveal. On the other hand, the patterns that emerge from 
properly done large-scale quantitative research cannot be 
detected in case studies, and the two are thus naturally 
conjplementary processes for a complete investigation of 
e.g. the health or law enforcement services in a city. Note 
that quantitative methods can often be applied on an intra* s 
case basis. One gets an adequate n either from multiple 
responses (Skinner) or multiple independently validated 
measures (Campbell). See also Naturalistic. 

CAUSATION The relation between mosquitos and 
mosquito bites. Easily understood by br *h parties but neyer 
satisfactorily defined by philosophers or scientists. 

CEILING EFFECT The result of scoring near the top of a 
scale — which makes it harder (even impossible) to imp-pve 
as easily as from a point further down. Sometimes de- 
scribed as "lack of headroom." Scales on which raters score 
almost everyone near the top will consequently provide 
little opportunity for anyorie to distinguish themselves by 

29 20 



outstanding (comparative) performance. In the language of 
the stock market, they (the scales plus the raters) provide 
"all downside risk/' (Typical of teacher evaluation forms). 
Usually tiiey (or the way they are used) should be recon- 
stmctedto avoid this; but not if they correctly represent the 
relevant range of the rated variable, since then the "upside" 
differences would simply be a measurement artefact. After 
all if all the students get all the answers right, there 
shouldn't be any headroom above their grade* \>n your 
scale. (You might want to use a different test, however, if 
your task was to get a ranking.) 

CENTRAL TENDENCY, (Measure of) (S^ The mis* 
leading technical term for a statistic that describes the 
middle or average of a distribution, as opposed to the extent 
to which it is spread thin, or lumped, the latter being the 
dispersion or variability of the distribution . 

CERTIFICATION A term like credentialing; which re- 
fers to the award of some official recognition of status, 
typically based on a serious or trivial evaluation process. 
Accreditation is another cognomen. The certification of 
evaluators has recently been discussed rather extensively, 
and raises a number or the usual problems; whp is going to 
be the super evaluator(s) who decide(s) on the rules of the 
game (or who lost), what would be the enforcement proce- 
dures, how would the cost be handled, etc. Certification is a 
two-faced process which is sometime* represented as a 
consumer-protection deVice— which it can be- -and some- 
times as a turf-protection device for the guild members, i.e. 
a restraint of trade process, which it frequently is. Medical 
certification was responsible for driving out the midwives, 
probablv at a substantial cost to the consumer; on the other 
hand, it was also responsible for keeping a large number of 
complete charlatans from exploiting the public. It certainly 
contributed to the indefensible magnitude of physicians' 
and lawyers' salaries/fees; and in this respect is consumer- 
exploitative. The abuses of the big-league auditors, to take 
another example, are well-documented in Unaccountable 
Accounting by Abraham Briloff (1973). When the state gets 
into the act, as it does with the certification of psychologists 
in many states, and of teachers in most, various political 
abuses are added to the above. In areas such as architecture, 

ERJC 30 • 



where non-certificated and certificated designers of domes- 
tic structures compete . linst each other, one can see some 
advantages to both a roaches; but there is very little evi- 
dence sup porting a si ngle overall conclusi on as to the direc- 
tion which is best for the citizenry, or even for the whole 
group of practitibtare. A well set up certification approach 
would undoubtedly ok the best; the catch is always in the 
political compromises Vi vol ved in setting it up; in other 
countries, the process is sometimes handled better and 
sometimes worse, depending upon variations in the politi- 
cal process. 

CERTIFICATION OF EVALUATORS See Evaluation 
Registry 

CHECKLIST APPROACH (to evaluation) A checklist 
identifies all significant relevant dimensions of value, ide- 
ally in measurable terms, and may also provide for weight- 
ing them according to importance. (It may also refer to or 
only to components.) The checklist provides an extremely 
. ver^Hle msH-iimpftfr for rfph>rmining qualify nf all kind* 

of educational activities and products. The checklist ap- 
proach reduces the probability of omitting a crucial factor. It 
reduces artificial overweighting of certain factors by careful 
definition of the checklist items, so as to avoid overlap 
(sometimes undesirable e.g. when it results in much less 
comprehensible dimensions). It also provides a guideline for 
investigating the thoroughness of implementation proce- 
dures and it reduces possible halo effect and Rorschach 
effect. It does not require a theory and should avoid de- 
pending on one as much as possible. Checkpoints — if there 
are many — should be grouped under categories that have 
commonsensr or obvious meaning/ to facilitate interpreta- 
tion. A check ist does not usually embody the appropriate 
combina* nial procedure for cases where the dimensions 
are highly interactive, i.e. where the linear or weighted - 
sum apptoach fails: such cases are rare. Checklists may list 
desiderata or necrfsitata. The former accrue points, the latter 
represent minimum necessary standards. (One checkpoint 
(dimension) may involve both.) It'* advisable to asterisk all 
absolute requirements and check them first, to avoid 
wasted time. See Weight & Sum* 

CIPP An evaluation model expounded in Educational 

mc 31 



Etmlpmtkm and Decision-Making by Gufa#, Stuffiebeam et * 
al-; the acronym refers to Context Input Process and Prod- 
uct evaluation, the four phases of evaluation they distin- 
guish; itjhould be noted that the* terms are used in a 
sfighBy special way.T^ossibty the most elaborate and care- 
fufiy thought out model extant; it underemphasized evalua 1 
tior for accountability or for scientific interest. 

CITATION INDEX The number of turns that a publi- 
cation or person is referenced in other publications. If used 
for personnel evaluation, this is an example, of a spurious 
quantitative measure of merit s/nce e.g. it depends on the 
size of the field, discriminates against the young, against 
those working on unfashionable topics, doe! Inat in fact 
identify a third of the Nobel laureates etc. Can be used for 
awarding a few bonus points, but only if there ate other 

ways to read hem e.g. indicators of pathbreaking in new 

fields. Most pi iusibie use is in evaluating the significance of a 
particular pub ication within a field, i.e. in history of ideas 
research; signi Seance in this sense is vary loosely related to 

., — , m e rit. — — — - — — — — — — 

CLIENT Ithe person (or agency, ate.) for whom an eval- 
uation is fo/maliy done. Usually to be distinguished from 
audience and consumer. In social program evaluation, the 
term "diem" is often used to mean the "consumer/' i.e.", 
the client of the program rather than of the evaluator; it is 
better to try to use the term "clientele" for that purpose. 

CLIENTELE The population directly served by a 
program. 

CLINICAL EVALUATION See Psychological Evalua- 
tion* 

CLINICAL PERFORMANCE EVALUATION In the 

health field, and to an increasing extent elsewhere (e.g. 
teaching evaluation), the term "clinical" is being used to 
stress a kind of "hands-on" situation which is typically not 
well tested by anything like paper and pencil teste. How- 
ever, it c- n be very well tested by appropriate simulations, 
as we have seen in some of the medical boards exams. It can 
also be very well tested by carefully done structured obser- 
vations by trained and calibrated observers. If one thinks of 
a paper and pencil test as a limiting case of a simulation, one 
9 ilizes the enormous extent to which it depends, in order 

ERIC , 23 

32 - 



to be realistic, upon imagination and role-playing skills that 
few of us possess. When one turns"fo look at standard 
simulations, one finds that these have inherited a great deal 

K^J of the artificiality of the paper and penal tests. For example, 
they rarely involve "parallel processing/' that is, the neces- 

9 sity of handling two or three tasks simiiltanedusly, A seri- 
ous clinical simulation would start the candidate on one 
problem, providing charts and histories, and then — just as 
this was beginning to make sense — a new 'problem with 
emergency overtones would be thrust at them, and just 
before they reached the point of making a preliminary 
emergency decision on that, a third and even more pressing 
problem would be'thrown at them. Given that there is some 

9 anxiety associated with test-taking for most people, one 
could probably come close to simulating clinical settings in 

this respect. We have long since developed simulations 

which involve the provision of supplementary information 
when requested by the testee, pert of the scoring being tied 
to the making of appropriate requests. But very few signs of 
careful job analysis show up in more advanced simulations 
where a true clinical performance is of interest 

CMI Computer Managed Instruction. Records are kept 
by the computer, usually on every test item and every 
student's performance to date. Important for large-scale 
individualized instruction. Computer may do diagnosis on 
basis of test results and instruct student as to materials that 
should be used next. Extent of feedback to student varies 
considerably; main aim is feedback to course managers), 

COGNITIVE The domain of the propositionally know- 
able; consisting of "knowledge-that," or "knowledge- 
how" to perform intellectual tasks. 

COHORT A term used to designate one group among 
many in a study, e.g. "the first cohort" may be the first 
group to have been through the training program being 
evaluated. Cf. Echelon 

COMFETENCY-BASED An approach to leaching, or 
training which focuses on identifying the competencies 
needed by the trainee, and on teaching to mastery level on 
these, rather than teaching allegedly relevant academic sub- 
jects to v&Qgus subjectively determined achievement levels. 
q Nice idea, but most attempts at it either fail to specify the. 

ERIC Q o 24 




mastery level in clearly identifiable terms or fail to show 
why that level should be regarded as the mastery level 
('TerformaiKe-based" is a cognomen . ) C-B Teacher Educa- 
tion (C BTE)was a big deal in mid-70s but *he catch was that 
no one uiuld validate Thecompetenties since styleres^arch 
has come up with so little. There is always the subject- 
matter competency requirement, of course, usually ignored 
in K-12 teacher training and treated as the only one in the 
post-secondary domain; but CBTE was talking about peda- 
gogical competencies— teaching method skills. See also 
Minimum Competency, Mastery, ' 

COMPLIANCE CHFCK, COMPLIANCE REVIEW 
An aspect of monitoring. 

COMPONENT EVALUATION A component of an 
evaluand is typically a physically discrete part $f it Jbut_ 

^ more precisely any segment that can be said to relate to 
others in order to make up the whole evaluand. (Typically, 
w c distinguish between the components and their relation- 
ships *n talking about the evaluand as a system made ujp>of_ 

^ j^rtswccH^^ 

does not involve any evaluation of its components; and an 
evaluation of components does not automatically imply an 
evaluation of the whole evaluand— excellent components. 
for an amplifier will not make a good amplifier unless they 
are correctly related by design and assembly relationships. 
But since components are frequently of variable equality, ~ 
and since we are frequently looking for diagnoses that will, 
lead to improvement, evaluating the components may be a 
very useful approach to formative evaluation. If wecan also - 
evaluate the relationships, we may have a very helpful kind 
of (especially) formative evaluation— how helpful will de- 
pend upon whether the "fixes" for defective components 
are self-evident or easily determined. Component evalua- 
tion is distinguished from dimensional evaluation, another 
kind of analytical evaluation, by the relatively greater like- 
lihood of manipulability, in a constructive way, of com- 
ponents by comparison with dimensfons (which may be 
statistical artefacts). And evaluauds with no components 
may have dimensions e.g. a vase. 

CONCEPTUAL SCHEME A set of concepts in terms of 
which one can organize and in a minimal sense understand 

r 25 

34 



the d^ta7results/observations/evaluations in an area of in- 
vestigation. Unlike theories, conceptual schemes involve 
no assertions or generalizations (other than the minute pre- 
sup positions of referen ti al constancy), bu t they do generate .. 
hypotheses and descnptive simplicity. . 

CONCLUSION-ORIENTED RESEARCH Contrasted 
with decision-oriented. Cronbach and Suppes' distinction, 
between two types of educational research, sometimes 
thought to illuminate the difference between evaluation 
research (supposedly decision-orienfcd) and academic so- 
cial science research (conclusion-oriented). This view is 
based on the fallacy of. supposing that conclusions about 
merit and value aren't conclusions, a holdover from the 
positivist, value-free doctrine that value-judgments an* not 
testable propositions, hence unscientific: and on the fallacy 
of sypposihg that all evaluation relates to some decision (the 
evaluation 6f many historical phenomena e.g. a reign or 3 
policy does not,) 

CONCURRENT VALIDITY The validity of an instru- 
ment which is supposed to inform us about the simultaneous 
state of another system or variable. Cf . predictive validity 
construct validity. 

CONFIDENTIALITY* One of the requirements that 
surfaces under the legitimate process considerations in the 
Key Evaluation Checklist. Confidentiality, as it is presently 
construed, relates to the protection of data about individu- 
als from casual perusal by other individuals, not to the 
protection of evaluative judgments on an individual from 
inspection by that individual. The requirement that indi- 
viduals be able to inspect an evaluative judgment made 
about them, or at least summaries of these with some at- 
tempt at preserving anonymity of the evaluator, is a rela- 
tively recent constraint on personnel evaluation. It is widely 
thought to have undermined the process quite seriously, 
since people can no longer say what they think of the 
candidate if they have any worry about the possibility of the 
candidate inferring their authorship and taking reprisals or 
thinking badly of them (if the evaluation was critical). It 
should be noted .that most large systems of personnel evalu- 
ation have long since failed because people were unwilling 
to do this even when complete anonymity was guaranteed. 




26 



This was generally true of the armed services systems and 
ihany state t college systems. There is no doufr that even 
amongst universities of the first rank there has been a nega- 
tive effect; but this mostly shows a failure of ingenuity on 
the part of pei^nnelpvaluators, since there are several 
ways to preserve complete anonymity, under even the 
weakest laws, namely those which only blank out the nime 
and title of the evaluator. See also Anonymity. 

CONFLICT OF INTEREST (COD One of many sources 
of bias. An evaluator evaluating his/her own products is 
• involved in a conflict.of interest — but the result may still be 
better than the evaluation done by an external evaluator 
since the tetter's loss of intimate knowledge of and experi- 
ence with the product and with evaluation rigor and metho- 
dology may not compensate for lack of ego-involvement. 
That is, although conflict of interest alwaysliurts credibility, 
it does notalways affect validity. But since it may easily affect 
validity, it is normally better to* use at least a mixture of 
internal and external evaluation. In choosing panels for 
evaluation, the effort to pick part^^'whoTiave nocgbflict 
of interest is usually misplaced or excessive; it is better to 
'choose a panel with a mix (not even an exact balance) of ' 
conflicting interests, since they are Hkely to know more 
about ttW area than those with no interests in it or against it. 
FinanciJ, personal and social ties are no different from 
intellectual commitment with respect to COI; all can pro- 
duce better insights as well as worse judgments. The key to 
managing COI is requiring that the argtrments be public and 
that their validity be scrutinized and votedon by those with 
other of go refevant COI. See Bias. 

CONNOISSEURSHIP MODEL Elliott Eisner's non- 
trrfditional method of evaluation is based on the premise 
that artistic and humanistic considerations are more im- 
portant in evaluation titan scientific ones. No quantitative 
analysis is used but instead theconnoisseur-evaluator ob- 
serve^firsthand the program or product being evaluated. 
The final report is a detailed descriptive narrative about the 
subject of the evaluation. Cf. Literary criticism, Aesthttic 
evaluation, Naturalistic, Responsive and Models. 

CONSONANCE/DISSONANCE The phenomena of 
cognitive consonance and dissonance, often associated 

27 36 



with the work of the social scientist Leon Festinger, are a 
major and usually underrated threat to the validity of client 
satisfaction surveys and follow-up interviews as guides/to 
program or product merit, (The limiting case is t he tendency 
to accept Presidential decisions. ) Cognitive consonance, not 
unrelated to the* older notion >of rationalization, occurs 
when the 'subject's perception of the merit of X is changed 
by his or her having made a strong commitfnent to X, e.g. by 
purchasing it, spending time taking it as therapy, etc. Thus 
a Ford Pinto may be rated as considerably better than a VW 
Rabbit after it has been purchased than&fore, although no 
new evidence has emerged whi^Ji justified this evaluation 
shift. This is the conflict of interest side of the coin whose 
if other side is increased knowledge of (e.g.) the product. 
Some approaches to discounting this phenomenon include 
very careful separation of needs assessment from perfor- 
mance assessment, thefselection of subjects having experi- 
ence with both (or several) options, serious task-analysis by 
f** the same trained observers, looking at recent purchasers of 
both cars, etc. The approvalof boot camp by Marines and of 
^ cruel initiation rites by fraternity brothers is a striking and 
important case — called "initiation-justification" bias in LE. 
(These phenomena also apply at the meta-level, yielding 
spurious positive evaluations of evaluations by clients.) 

CONSTRUCT VALIDITY The validity of an instru- 
ment (e.g. a test* or observer)* as an indicator of the 
presence of (a particular amount of) a thecfretical construct. 
The construct validity of a thermometer as an indicator of 
temperature is high, if it has been correctly calibrated. The 
key feature of construct validity is thaKhere can be no 
simple test of it, since there is no simpte tesfl of the presence 
or absence of a tKeoretical construct. We can only infer to 
that presence from the interrelationships between a number 
of indicators an4 a theory which has been indirectly con- 
firmed. The contrast is with predictive and concurrent val- 
idity, which relate the readings on instrument to another 
directly observable variable. Thus, the predictive validity of 
a test for successful graduation from a college, administered 
before admission, is visible on graduation day some yeats 
later. But the use of a thermometer to test temperature 
cannot be confirmed by looking at the temperature; in fact, 
the thermometer yk jj§ near as we ever get to the tempera- 

ERIC * 28 



hire. Over the history of thermodynamics, we have adopted 
four successive different theoretical definitions &f temflpi- 
ture, although you couldn't tell this ffom looking at ther- 
mometers. Thus, what the thermometer has "read" ha£ 
be^i four different theoretical constructs and its validity as 
an indicator of one of these is not at aU the same as its 
validity as an indicator of another. No thermometer reads 
anything at all in the region immediately above absolute 
zero, since all gases and liquids have solidified by that point; 
nevertheless^ this isa temperature range; and we infer what • 
the temperature is, there, by complicated theoretical calcu- 
lations from other variables. The validity of almost all tests < 
used lor evaluative purposes is construct validity, because 
.the construct towards which they point (e.g. "excellent 
computational skills") is a complex construct and not ob- 
servable in itself. This follows from th« very nature of evalu- 
ation as involving a synthesis of several performance scales. 
But of course it does not follow that evaluative conclusions 
are essentially less .reliable than those from tests with de- 
monstrated predictive validity, since predictive validities' 
are entirely dependent upon the persistence through time 
(qften long periods of time) of a relationship— a depen- 
dency which is often shakier than the inference to an in- 
tellectual skill such as computational excellence from a 
series of observations of a very talented student faced with 
an array of previously unseen computational tasks. Ther- 
mometers are highly accurate though they "only" have 
construct validity. Construct validity is rather more easily 
attainable with gpspect to constructs which figure in a con- 
ceptual scheme th#t does not involve a theory; only the 
requirements of taxonomical merit (clarity, comprehensive- 
ness, insight, fertility etc.) need to be met, not confirmation 
of the axioms and laws-of the theory. (Such constructs are 
still called "theoretical constructs," perhaps because con- 
ceptual schemes shade and evolve irjfo theories so fluidly.) 

r CONSULTANT Consultants are not simply people 
hired for advice on a short-term basis, as one might suppose 
from the term; they include a number of people who are 
essentialfy regular (but not tenured) staff members of state 
agencies, where some budgetary or bureaucratic restriction 
prevents the addition of permanent staff, but allows a semi- 




permanent status to the consultant. Hence an evaluation 
consultant is not always an external evaluator. The basic 
problem about being an evaluation consultant, as a career, 
is that— with the exception of the semi-permanent jobs just 
mentioned — you have to make enough on the days you're 
working to carry you through the days when you're not, 
and in the real world it is highly unlikely that jobs will be 
kind enough to fill your time exactly. Meanwhile, some of 
your overhead e.g. secretarial (usually) and rent, will con- 
tinue, as well as your grocery bills, etc. To net $25,000 you 
need to make $40,00Q which is only $20/hr. , if you were paid 
on a salary basis, but requires $30/hr. for working time to 
cover the blank periods plus professional meetings, reading, 
consultations, etc., i.e. $240/day. But the current agency 
maximum is about $150 in HHS/ED, i.e. they want you to 
work for around $16,000/year without tenure or fringe be- 
nefits. It is not surprising that the only feasible, as well as 
the most cost-effective consultants from the client's point of 
view have to be people with full-time jobs who do their 
consulting as moonlighting. In tjjis' way, the un i versiti es- 
subsidize the government as well as vice versa. In the man- 
agement consultant field, where fees are very much higher 
than in the evaluation consultant field — though not as high 
as a regular attorney's fees— this is less of a problem; but in 
the human services program evaluation area, the true cost 
of first-rate consultants is far beyond the budgetary limits 
placed on consulting fees by agencies. Some system of 
payment by results should be allowed as an alternative, so 
that there would be some incentive for fast and extremely 
good work by full-timers, instead of spreading the work out 
and moonlighting it. There are small job-contracts but few, 
under attack, and decreasing. The big shops have some 
full-time evaluators on staff, but only for big projects 
funded by agencies, not as consultants for the average small 
client. 

CONSUMER The "true consumers" of a product, serv- 
ice or program are the persons who are being directly or 
indirectly affected at the using or receiving end of a product 
or program — the most important part of the impacted popu- 
lations. The true consumers are not usually just the target 
population. (The "consumers" of art evaluation are its audi- 
ences.) The staff of a program are alstyaffected by the pro- 

39 



gr»m, but at the producing* or providing end— we call that 
the recoil effect. 

CONSUMER-BASED EVALUATION An approach to 

toC PVfllltaHrwi of ffyrHfalHA a nmtrrnm frhmt miamSm unfit *~+*4 

/beasts ow tite impact on the consumer or clientele or — to be 
mAe «act — the impacted population. It might or might 
not be done goal-free, though dearly that is the methodol- 
* ogy of choice for consumer-based evaluation. It will particu- 
larly«tocus cm the identification of non-target populations 
that are impacted, on unintended effects on true cost to the 
consumer etc. \ 

CONTENT ANALYSIS The evaluati^or pre-evahi- 
ative process of systematically determining tMcharacteHst- 
ics of a body of material or practices, e.g. te^books, 
courses, jobs. A great many techniques have beenlle^U 
oped for doing this, running from frequency counts on 
wqjrds of certain kinds (e.g. personal references), to analysis 
of plot structure in illustrative stories to determine whether 
the dominant : figure is e.^. nwle or female, white or non- 
white. The use of content analysts is jusUs important A 
determining whether the evaluand matcKes the "official" 
description of it, a% it is in determining what it is and what it 
does in other dimensions than those involved in the "truth 
in packaging" issue. Thus, a sooM studies chart entitled 
*"Gieat Aipericans" could be subject to content analysis in 
order to determine whether those listed were actually great 
Americans (truth in labeling); but even if it passed that tot, 
it would be subject to further content analysis for e.g. sex- 
ism, because a list that did nut contain the names of the 
great women suffragists would show a deformed sense of 
values, although it might be too harsh to argue that it was 
not correctly labeled. Notice that none of this refers to a 
study of the actual effects (pay-oil evaluation), but is a type 
of legitimate process evaluation. The line between the two 
is not sharp, since literal falsehoods may be the best peda- 
gogical device for getting the student to remember truths. 
Although this approach would then violate the requirement 
of scientific or disciplinary integrity (a process considera- 
tion), this woulfl be excused on the grounds that the only 
point of the work is to produce the right effects and that 
teaching the correct and much more complicated account 

ERJC . , " 40 



l£ads to less accurate residual learning than te«icb ; <ip « 
incorrect account. It is not an exaggeration to say it mos' 
elementary science courses follow the mode* or * V- 0 
untruths in order to get ^proximate truths instill t\ the 
brains of the students. A more radical view would hold that 
human brains in general require knowledge to be presented 
in the form of rather simple untruths rather than true comp- 
lexities. An excellent brief discussion of contentanalysis by 
Sam Ball will be found on pp. 82-84 of the Encyclopedia of 
Educational Evaluation* which he co-edited for Jossey- 
Bass, 1976. 

CONTENT VALIDITY The f operty of tests that after 
appropriate content analysis, appear to meet^a!! require- 
ments for congruence-between claimed and actual content. 
Thus a test of net-making ability should contain an ade- 
quate (weighted) sampling of all and only those skills which 
the expert net-maker exhibits. Note that this is an example 
of a mainly psychomotor domain of skills; content validity is 
not restricted to the cognitive or verbal areas. Content valid- 
ity is one step more sophisticated than face validity and one 
step less sophisticated than construct validity. Sc it can be 
seen as a more scientific approach to face validity or as a 
less-than-comprehensive approach to construct validity. 
The kind of evaluation that is involved in and leads to 
credentialing by the state as a teacher of e.g. mathematics 
(in the U.S.) is content invalid because of ; ts grotesque 
failure to require mathematical skills at anything like a 
reasonable level .(e.g. same level as the second quartile of 
college sophomores majoring in mathematics). In general, 
like other forms of process evaluation, content validity 
checks are considerably quicker than construct validity ap- 
proaches, and frequently provide a rather highly reliable 
negative result, thereby avoiding the necessity for the longer 
investigation. They cannot provide a positive result so 
easily, since content validity is a necessary but not a suffi- 
cient condition for merit. Content validity is a good example 
of a concept developed in one evaluation environment (test- 
ing, i.e. evaluation of students or patients) that transfers, 
well to another viz, personnel evaluation (candidates and 
employees), once one starts thinking about evaluation as a 
single discipline, logically speaking. 




it 



CONTEXT (of evaluation or evaluand). The ambient 
circumstances that do or may influence the outcome; they 
include attitudes and expectations, not just level of support 
funding, etc. 

CONTRACT See Funding. . 

CONTRACT TYPES The usual categories of contract 
types (this particular classification comes from the Eckman 
Center's The Project Manages Workphm (TPXiWP)) are 
fixed price, turn and materials, cost reimbursement, cost 
plus fixed fee, cost plus incentive fee, cost plus sliding fee 
and joint powers of ag/eement. Explaining the differences 
beyond thftse obvious from the terms would be telling you 
more than you want to know unless you are about to be* 
come a large-project manager, in which case you'll need 
TPMWP, ami may be able to afford it (price upward of $30); 
it can be ordered from The Eckman Center, P.O. Box 621, 
Woodland Hills, CA 91365. That's the technical stuff; but at 
the commonsense level, it's a good idea to have something 
in writing that covers the basics e.g. when payments ate to 
be made (and under what conditions they will not be made) 
and who is empowered to release the results (? d when). 
Dan Stufflebeam has the best checklist for this, in his forth- 
coming (1982) text: until therv in his monograph in the 
series from the Evaluation Center, Western Michigan. 

CONTROL GROUP A group which does not receive 
the "treatment" (e.g. a Service or product) being evaluated. 
(The group which does receive it is the experimental group, a 
term which is used even though the study may be ex post 
facto and not experimental.) Th£ function of the control 
group is to determine the extent to which the same effect 
occurs without the treatment. If the extent is the same, this 
would tend to show that the treatment was not causing 
whatever changes were observed in the experimental 
group. To perform this function, the control group must be 
"matched," i.e., so chosen as to be closely similar — -ot 
identical to — the experimental group. The more carefully 
the matching is done (e.g. by using so-called "identical 
twins"), the more sure one can be that differences in out- 
come are due to ihe experimental treatment. A great im- 
provement is achieved if you can randomly assign matched 



33 

42 



subjects to the two gtoups, an 1 arbitrarily designate one as 
the experimental and the other as the cqntrol group. This is 
a "true experiment"; other cases are weaker and include ex 
post facto studies. Matching would ideally cover all en* 
vironmfental variables as well as genetic ones — all variables 
except the experimental on^(s) — but in precipe we match 
only on variables which we think are likely to affect the 
* results significantly, for example, sex, age, schooling. 
Matching on specific characteristics (stratifying) is not es- 
sential, it is only efficient: a perfectly good control {(roup 
can be set up by using a (much larger) random sample of the 
population as the control group (and also for the experi- 
mental or treatrr °nt group). The same degree of confidence 
in the results can thus be achieved either by comparing 
small closely matched groups (experimental and control) or 
large entirely randomly selected groups. Of course, if 
you're likely 'o be wrong — or if you're in doubt — about 
which variables to match on, the large random sample is a 
better bet even though more expensive and slower. It 
should be noted tha| it is sometimes important to run sev- 
eral "control groups" ahd that one could then equally well 
call them all experiments groups or comparison groups. 
The classical control group is the "no treatment" group, but 
it's not usually the mQt relevant to practical decision- 
making (see Critical Competitor). Indeed, it's often not . 
even clear what "no treatment" means: e.g. if you withhold 
\mir treatment from ^control group in evaluating psycho- 
tmsapy , they create their own, and may- change behavior 
just because you withheld treatment — they may get 
divorced, get religion, change or lose their job, etc. So you 
finish up comparing psychotherapy with something else, 
usually a mixture of things, not with nothing; not even with 
no psychotherapy, only with no psychotherapy of your 
particular brand. Hence it's better to have control groups 
that get one or several standard alternative treatments than 
"leave them to their own devices," a condition into which 
the "no treatment" group often degenerates. And in evalu- 
ation, that's exactly where you bring in the critical com- 
petitors. In medicine, that's why the control group gets a 
placebo. It is crucial to understanding the logic of control 
groups that one realize they pnly provide a one-way test of 
causation. If there is a difference between the dependent 



ERIC 



■ 34 



43 



variables) as between die two groups (and if the matching 
is not at fault) then the experimental treatment has been 
Sho^n to have an effect. But if tftere is m difference, it has , 
not been shown that the treatment has no effect, only no 
grmter effect than' whatever (mixture of treatments) hap- 
pened to the control group. A corollary of this is that the 
differential effect size, when there is one, cannot be identi- 
. Bed as the total effect size of the treatment, except in a 
situation where the contort is an absolute no-treatment 
group*- more feasible in agriculture than mammalian 
research. 

CONVERGENCE GROUP iStufflebeam). A team 
whose task is to develop the best version of a treatment from 
various stakeholder or advocate suggestions. A generaliza- 
tion of the term, to convergence sessions, covers the process 
that should follow the use of parallel (teams of) evaluators, 
viz. die comparison of their toriften reports and an attempt 
to resolve disagreements/ This should be done in the first 
place by the separate teams, witi eferee (group) present 
to prevent bullying; it may later be best to use a separate 
convergence (synthesis) grpup, 

CORRECTION FOR GUESSING Ii* multiple-choice 
exams with n alternatives in each question, the average 

* testee would get 1/n of the marks by guessing atone. Thus if 
a student fails to complete such in exam, it has been sug- 
gested that one should add 1/nfr of the number of uft- 

* . answered questions to his or her **fore, in order to get a fair 

comparison with the score of ;esfcee that answers all the 
questions by guessing the ones they do not h ivetimetodo 
seriously. There are difficulties both with this suggestion 
("applying the correction for guessing") and with not using 
it; the correct procedure will depend on a careful analysis of 
the e,;act case. Another version of the correction for gues- 
sing involves subtracting th ? rumber of answers that one . 
would expect to get by guessing from the total scofe, 
whether the test is completed or not These two approaches 
giv£ essentially the same (grading or ranking but not scor- 
ing) results, but *heir effects may interact differently with 
different instructions on the test and different degrees of 
condition in the testees. In general, ethics requires that if 
sucn correction^ will be msed, they be pre-explained to 
testees. 




CORRELATION The relationship of concomitant oc- 
currence or variation . Its relevance to evaluation is (a) as a 
hint that a causal relation exists (showing an effect to be 
present), <b) to establish the validity of an indicator. The 
range is from -1 to -hi, with 0 showing random relation- 
ship, 1 showing perfect (100%) correlation (+1) or perfect 
avoidance (-1). 

COST AVOIDANCE A crucial element in evaluating 
new systems (which includes e.g. new technology or mana- 
gers since they inevitably create new systems or die possi- 
bility of them). Onr should look for cost avoidance first, cost 
savings on carry over procedures second, though they are 
all cost savings in a more general sense. A word processor 
avoids the cost of reproofing unchanged material, it reduces 
the (incremental and tim j) cost of retyping corrected drafts. 

COSTS, COST-ANALYSIS Cost is negative utility. 
Economists define it relativisticatly as "the maximum val- 
ued opportunity necessarily forsaken," i.e. as opportunity 
cost, but it is usually better for evalua tors to use the informal 
("absolute") sense since clients understand it better— and 
separately consider opportunity costs. One relativistic 
dimension must always be present however, since costs do 
not exist without specifying the person(s) who bear the cdst. 
Cost-analysis should thus always yield a matrix, with 
"payers" down one column, and types of cost across the 
rows. It is ofteqjuseful to distinguish initial (start-up) costs 
fronnfunning (maintenance) costs; capital costs from cash 
flow; discounted from raw costs; direct from indirect costs 
or overhead, which includes depreciation, maintenance, 
taxes, some supplies, insurance, some services, repairs, 
etc.; psychological from tangible costs; outlays from oppor- 
tunity costs. The "human capital" or "human resources" 
approach stresses one non-monetary component. "Margi- | 
nal analysis" looks at the relative add-on costs, from a given 
cost-level, and is often both more relevant to a decision- 
maker's choices at that basic cost-level, and more easily 
calculated. Cf.'Zero-Based Budgeting, Budget/ 

COST-BENEFIT OR BENEFIT-COST ANALYSIS 

Cost-benefit analysis goes a step beyond cost-effectiveness 
analysis (see below) and estimates the overall cost and be- 

ERIC 45 36 



nefit of each alternative (product or program) in terms of a 
single quantity, usually money. This analysis will provide an 
answer to the question: Is this program or product worth its 
cost? On Which of the options has the highest benefit/cost 
ratio? (It is often not possible to do cost-benefit analysis, e.g. 
when ethical, intrinsic, temporal, or aesthetic elements am 
at stake.) 

COST-EFFECnVtNESS ANALYSIS The purpose of 
this type of analysis is to determine what a program or 
procedure costs, and what it does (effectiveness), the latter 
often being described in terms of qualities (pay-offs) which 
cannot be reduced to money terms, or tenmy other single 
dimension of pay-off. This procedure does not provide an 
automatic answer to the question: Is this program or prod- 
uct worth its cost? The evaluator will have to weight and 
synthesize the needs data with cost-effectiveness results to 
get an answer, and even that may not give an unequivocal 
result. But it clarifies the chokes considerabl y . 

COST-FEASIBILITY ANALYSIS Determining on a 
Yes/No basis whether something can be afforded (this 
means you can afford the initial and the continuing costs). 

COST-FREE EVALUATION The doctrine that evalua- 
tions should, if properly designed and used, provide a net 
€ positive return, on the m They may do this by leading 
eifher to the elimination of ineffective programs or proce- 
dures, or to an increase in productivity or quality from 
existing resources/levels of effort. The equivalence tables 
between costs and benefits should be set up to match the 
client's values, and accepted by the client, before the evalu- 
ation begins, so as to avoid undue pressure to be cost-free 
by cost-cutting only, instead of by quality-improvement as 
well as cost-cutting (if the latter is requested at all). ' 

COST PLUS Another basis for calcuiating'budgets on 
contracts is the "cost plus" basis, which allows the con- 
tractor to charged or costs plus a margin of profit; depending 
on how "profit" is defined, this may mean the contractor is 
making less than if the money was inli savings account and 
s/he was on a salary at some other jdb, or a good deal more. 
Sometimes cost plus contracts, since they usually omit any 
real controls to keep costs down (indeed, sometimes the 



reverse/since the "plus" is often a percentage of the basic 
cost), are not ideal for the taxpayer either. This has 
prompted the introduction of the "cost plus fixed fee" basis, 
whe-e the fee is fixed and not proportional to the size of the 
< contract. That's sometimes better, but sometimes— when 
the scope of work is enlarged during the project, <by the 
discovery of difficulties or (subtly) by the agency— it 
shrinks the profit below a reasonable Jevel . The profit, after 
all, has to carry the contractor through periods when con- 
tracts happen not to abut perfectly, pay the interest on the 
capital investment and provide some recompense for high 
risk. The justificationfor cost plus contracts is very clear in 
circumstances where it is difficult to foresee what the costs 
will be and no sane contractor is going to undertake some- 
thing with an unknown cost. Especially if the agency wishes 
to retain the option of changing the conditions that are tq.be 
met, the hardware that is to be used, etc., say in the tight of 
obsolescence of the materials available at the beginning, the 
cost plus percentage contract can make sense. Competitive 
bidding is still possible, after all. 

CREDIBILITY Evaluations often need to be not only 
valid but such that their audiences will believe that they are 
valid (cf . "It is not enough that justice be done, etc."): This 
may require extfa care about avoiding (apparent) conflict of 
interest, for example, even if in a particular case it does not 
in fact affect validity. It should not be forgotten that credibil- 
ity is often necessary for the internal audience (the staff) in a 
formative evaluation and not just for the external audi- 
ence(s). Internal credibility is a major reason for using a 
local expert, who knows the jargon, has subject-matter area 
status, understands "the cross we all bear," etc. 

CRITERION The criterion is whatever is to count as the 
"pay-off/' e.g. success in college is often the "criterion 
measure" against which we validate a predictive test like a 
college entrance examination. Ability to balance a check- 
book might be one "criterion behavior" against which we 
evaluate a practical math course. (Cf. standard) 

CRITERION-REFERENCED TEST This type of test 
provides information about the individual's (or a group's) 
knowledge or performance on a specific criterion. The test 
scores are thus interpreted by comparison with pre-deter- 

ERIC 47 38 



mined performance criteria rather than by comparison with 
a reference group (see Norm-Referenced Test). The merit of 
such tests depends completely on the (educational) signifi- 
cance of the criterion— trivial criterion, trivial test; theory - 
impregnated criterion, theory-dependent test— and on the 
technical soundness of the test. It is not within an amateur's 
- « *e usual teacher's domain of competence to construct 
such teste for the basic stalls or most major curriculum 
objectives, and when they do the results are often unin- 
terpreted because we know neither whether the subject 
understood the question nor whether s/he should be able to 
answer it. It is clear that successful construction of such tests 
is also beyond the capacity or interest of most professionals: 
we still lack one good functional literacy test, let alone four 
or five to choose from. Grading on a course-test is more 
manageable and is the simplest case of a criterion-refer- 
enced test. (Q. ranking) , 

CRITICAL COMPETITORS Critical competitors are 
those entities with which comparisons need to be made 
when a program, product, etc., is being evaluated. The 
critical competitors can be real or hypothetical, e.g. another 
existing text or one we could easily make with scissors and 
paste. They bear on the question whether the best use of the 
money (and other resouces) involved is bring made, as 
opposed to the pragmatically less interesting question of 
whether it's just being thrown away~ You don't just want to 
know whether this $20.00 text is good; you want to know if 
there's a much better one for $20.00, or one that is just as 
good for $10.00. Those others are (two of) the critical com- 
petitors that should figure in *he evaluation of the text. So 
should a film (if there is one), lectures, TV, a job or intern- 
ship, etc., where they or an assemblage of them cover 
similar material Traditional evaluation design has fended 
, to use a no- treatment control group for the comparison, 
which is incorrect; "no treatment" is rarely the real option. 
It's either the old treatment or another innovative one, or both, 
orahybrid, or something no one has so far seen as relevant 
(or perhaps not even put together). These unrecognized or 
"created" critical competitors are often the most valuable 
contributions an evalua tor makes and coming up with them 
requires creativity, local knpwledge and realism. In eco- 
nomics, concepts like cost are often defined in terms of 

48 



L 



(what amounts to being the) critical competitors, but it is 
assumed thaj identifying them is easy. Standard critical 
competitor in an evaluation of rug shampoos' (for ex- 
ample) are easy to identify — everything called rug 
shampoo— but the non-standard ones are the most impor- 
tant. In this case Consumers Union included a dilute solu- 
tion of Tide, which out-performed and undercut the cost of 
* all the shampoos by a country mile, „ 

CRITICAL INCIDENT TECHNIQUE (Flanagan) This 
approach, tied to the analysis of longitudinal records, at- 
tempts to identify significant events or times in an individu- 
al's life (or an institution's life, etc.) whk*h in some way 
appear to have altered the whole direction of subsequent 
events! It offers a way of identifying the effects of e.g. 
schooling, in circumstances where a full experimental study 
is impossible. It js, of course, fraught with hazards. (Ref. 
John Flanagan, Psychological Bulletin, 1954, pp. 327-358.) 

CROSS-SECTIONAL (study) If you want to get the + 
riesults that a longitudinal study would give you, but you 
can't wait around to do one, then you ^an use a cross- , 
sectional study as a substitute whose validity will depend 
upon certain assumptions about the world. In a cross- 
sectional study, you look at today's first year students and % 
today's graduating se/uors and infer e.g. that fhe college 
experience has produced or can be expected to accompany 
the difference between therti; in a longitudinal study you 
would look at today's first year students and wait and see 
how they change by the time-t -y become graduating 
seniors. The cross-sectional study substitutes today's gra- 
duating seniors for a population which you cannot inspect 
for another four years, namely the seniors that today's 
freshman or first year students will become. The assump- 
tions involved are that no significant changes in the demo- 
graphics have occurred since the present seniors formed the 
entering class, and that no significant changes in the college 
have occurred since that time. (For certain inferences, the 
assumptions will be in the other direction in time.) 

CRYFTO-EVALUATIVE TERM A term which appears 
to be purely descriptive, but whose meaning — in the par- 
ticular^ontext — necessarily (definitionaily) involves evalu- 
1 ~ ative concepts e.g. intelligent, true, deduction, explanation. 
ERLC Cf.Val a e.imba^ h 4Q 



CULTURE-FAIR/CULTURE-FREE A culture-free test 
avoids bias for or against certain cultures. Depending upon 
how generally culture is defined, and on hov the testyi< 
U5*»d, this bias may or may not invalidate the test. Certain 
types of problem-solving tests involving finding food in an 
artificial desert to avoid starvation, for example, 3re about as 
near to culture-free as makes any sense; but the / are a little 
impractical to use. To discover that a test d»ix:riminates 
between e.g. races with respect to th^numbers who pass a 
given standard, has absolutely no relevance to the question 
of whether the test is culture-fair If a particular race has 
been oppressed for a sufficiently long time, then its culture 
will not provide adequate support for intellectual exercises 
(or athletic ones, depending upon the type of oppression); it 
' will probably not provide the dietary prerequisites for full 
development; and it may not provide the role models that 
stimulate achievement in certain directions. Hence, quite 
apart from any affects on the gene pool it is to be expected that 
that racial group will perform worse on certain types of 
tests — if it did not, the argument that serious oppression 
has occurred would be weakened. Systematic procedures 
are now used to avoid clear cases of cultural bias in test 
items, but these are poorly understood. Even distinguished 
educators will sometimes point to the occurrence of a term 
like "chandelier" in a reading vocabulary test as a sign of 
cultural bias, on the grounds that oppressed groups are not 
likely to have chandeliers in their houses. Indeed they are 
not, but that's irrelevant; the question is whether the term 
reliably indicates wide reading, and hence whether a suffi- 
cient number of the oppressor group in fact picked up the 
term through labeling an object in the environmeot rather 
than through wide reading to invalidate that inference. 
That's an empirical question, not an a priori one. A similar 
point comes up in looking at the use £>f te^t soores for 
admission selection; validation of a cut-off is properly based 
on prior experience, and may be based on a mainly white 
population. In such a case, the use of the same cutting 
scores for minorities will tend to favor them, as a matter of 
empirical fact (possibly because the later efforts of those 
individuals get less peer/home support than in the white 
population). 

CURRICULUM EVALUATION Curriculum evaluation 

° 42 

EBIS . 50 



can be treated as a kind of Product evaluation, with f he 
emphasis on outcome studies of those using the curricut ; 
or it can be approached in terms of content validity. ("Cur- 
riculum" can refer to the content or to the sequencing of 
courses, etc.) A popular fallacy in the area involves the 
supposition that good tests used in a curriculum evaluation 
should match the goals of the curriculum or at least its 
content; on the contrary, if they are to be tests of the cur- 
riculum, they must be independently constructed, by refer- 
ence to the needs of the user population and the general 
domain of the curriculum, without regard to its specific 
content, goals and ob|ectives. Another issue concerns the 
extent to which long-term effects should be the decisive 
ones: since they are usually inaccessible because of time or 
budget considerations, it is often thought that judgments 
about curricula cannot be made reliably. But essentially all 
long-term effects are best predicted by short-term effects, 
which can be measured. And the causal inferences involved 
from temporally remote data, even if we could wait to study 
the long-term situation, are so much less reliable that any 
gains from the long- term study would likely be illusory. 
One of the most serious errors in a great deal of curriculum 
evaluation involves the assumption tl^at curricula are im- 
plemented in much the same way by different teachers, or 
in different schools; even if a quite thorough checklist is 
used to ensure implementation, there is sail a great deal of 
slippage in the teaching process. In the more general sense 
of curriculum, which refers to the sequence of courses taken 
by a student, the slippage occurs via the granting of excep- 
tions, the use of less-than-valid challenge exams, the sub- 
stitution of different instructors for others on leave, etc. 
Nevertheless, good curriculum materials and good cur- 
riculum sequences should be evaluated for gross differ* 
ences in their effectiveness and veracity /comprehensive- 
hess/relevance to the needs of the ^tudents. The differences 
between good and bad are so large and common that, de- 
spite all the difficulties, very much improved versions and 
choices can result from even rough and ready evaluation of 
content and teachability. Davis identifies the following 
components in curriculum evaluation: determining the 
actual nature of the curriculum (and its support system of 
counselors, other curricula, catalogs, etc. )as compared with 

IIC Di 42 



the official descriptions (e.g. via transcript analysis, cur- 
ricuhun analysis of class notes); evaluating its academic 
quality; examining procedures for its evaluation and revi- 
sion; assessing student teaming; stutfent surveys including 
• exit of alumni interviews; faculty surveys; surveys of em- 
\ pbyers and potential employers; reviews by professional 
$ , curriculum experts; comparison with any standards pro- 
vided by relevant professional associations; checking with 
leading schools or colleges to see if they have improve- 
ments/updates that should be considered. Ref. Designing 
and Evaluating Higher Education Curriculum, Lynn Wood 
& Barbara Gfoss Davis, AAHE, 1978. 

CUTTING SCORE A score which mkrics the line be- 
tween grades, between mastery and non-mastery, etc. 
Always arbitrary to some degree, it is justifiable in circum- 
stances where a number of such scores will be synthesized 
eventually. But in a final report, only cutting zones make 
sense and the grades should indicate this, e.g. A, A-, AB, 
B+, . . . where the AB indicates a borderline area. Many 
opponents of minimum competency testing complain about 
the arbitrariness of any cut-off point; the response should be 
to use a zone, i.e., three grades instead of two (clearly not 
Competent; uncertain competence; deafly competent). 

DATA SYNTHESIS The semi-algorithmic semi-judg- 
mental process of producing comprehensible facts from raw 
data via descriptive or inferential statistics, and interpreta- 
tion in terms of concepts, hypotheses or theories. 

DECILE (Stat.) See Percentile. 

DECISION-MAKER It is sometimes important to dis- 
tinguish between making decisions about the truth of vari- 
ous propositions, and making decisions about the disposi- 
tion of (or appropriate action about) something. While the 
scholar automatically falls into the first category, s/he typi- 
cally only serves as a consultant to a decision-maker of ihe 
v second type. Most discussion about decision-makere in the 
evaluation context refers to those with the power to dis- 
pose, not merely with the power to propose or draw con- 
clusions. 

DECISION-ORIENTED RESEARCH See Conclusion- 
Oriented Research. The distinction is essentially invalid in 

ERJC « 5g 



good evaluation, since the conclusions are about the light- 
ness of decisions. 

DECISION RULE A link between an evaluation and 
action, e.g. "those with a grade below C must repeat the 
course"; "Hypotheses which are not significant at the .01 
level will be abandoned. "-^The latter example is a common 
decision rule but logically improper; see Null Hypothesis.) 

DELIVERY SYSTEM The link between a product or 
service andlhe populate *h?t needs or wants it. Important 
to distinguish this in evaluation, because it helps avoid the 
fallacy 6f supposing that the existence of thejneed justifies 
the development of something to meetjbeiieed. It does so 
only if one can either develop a ^i£w (or make use of an 
existing) delivery system that reaches the needy. (A market- 
ing system reaches those with wants, which may or may not 
happen to be needs.) 

DELPHI TECHNIQUE A procedure used in group 
problem solving, involving — for instance — circulating a 
preliminary version of the problem to all participants, call- 
ing for suggested rephrasings (and/or preliminary solu- 
tions). The rephrasings are th*.»n circulated for a vote on the 
version that seems most fruitful (and/or the preliminary 
solutions are circulated for rank ordering). When the rank 
orderings have been synthesized, these are circulated for 
another vote. Innumerable variations on this procedure are 
practiced under the title "Delphi Technique," and there is a 
considerable literature on it. It is often done in a way that 
over-constricts the input, hence is ruined before it begins. 
In any case, the intellect of tfte organizer must be the equal 
of the participants or the best suggestions won't be recog- 
* nized as such. A phone conference call may be more effec- 
tive, faster and cheaper, perhaps with one chance at written 
after-thoughts. But a good Delphi is worthwhile. 

DEMOGRAPHICS The characteristics of a population 
defined in terms of its entry characteristics, going into a 
testing program, forexarrtple — age, sex, level of education, 
occupation, place of birth, residence, etc., "by contrast with ^ 
the results e.g. IQ, attitude, scores. 

DEPENDENT VARIABLE One which represents the 
outcome — contrast is with the independent variables 

ERIC 53 « 



^ 



which are the ones we (or nature) can manipulate directly. 
That definition is circular and so are all others; the distinc- 
tion between dependent and independent variables is an 
ultimate notion in science, definable only in terms of other 
such notions, e.g. randomness, causation. 

DESCRIPTION Often the hardest part of an eval_ 
for beginners, because they think its easy labeling is 
enough! Describing normally includes evaluation language 
("You can't miss it, if s me near-perfect 1975 Porsche 911 
near the end of the row"). In evaluation* it's usually desir- 
able to separate the description from the evaluation, and 
like screening out theory-impregnated language in describ- 
ing a physics experiment, this turns out to be hard. Making 
the description complete enough for replication or for click- 
ing implementation of the treatment is hard. Desa^son 
may also indude describing the true function <rf something, 
which sometimes requires deep analysis. (Describing DNA 
would normally refer to its function; it is rarely restricted to 
what can be directly observed.) Keeping the description 
concise— often important— by restricting it to salient fea- 
tures requires a neixls assessment done on the audiences for 
the evaluation, as does the choice of language level. De- 
scriptions provided by the client should be treated as claims 
for verification, not premises. Stakeholders normally pro- 
vide multiple inconsistent descriptions, all of which are 
usually wrong (or too vague to be acceptable). The delivery 
system and support system shpuld usually be included 
in the total description. 

DESCRIPTIVE STATISTICS The part of statistics con- 
cerned with providing illuminating perspectives on or re- 
ductions of a mass of data (cf. inferential statistics); typi- 
cally this can be done as a translation, involving no risk. For 
example, calculating the mean score of a class from its 
individual scores is straight deduction and no probability is 
involved. But estimating the mean score of the class from the 
actual mean of a random sample of the class is of course 
inferential statistics. 

DESIGN (of evaluation; see Evaluation Design) 

DIFFUSION The process of spreading information 
about (typically) a product (cf. dissemination with which 




ERIC. • 45 54 



diffusion is deliberately and somewhat artificially con- 
trasted). 

DIMENSIONAL EVALUATION A species of analyti- 
cal evaluation in which the meritorious performance is bio- 
ken out into a set of dimensions that have useful statistical 
properties (e.g. independence) or are familiar from other 
contexts and easily grasped, etc. Useful for explaining the 
meaning of an evaluation report; component evaluation is 
more useful for explaining the cause of an evaluated per- 
formance. Cf Component Evaluation. 

DISCREPANCY EVALUATION (Provus) Evaluation 
conceived of as identifying the gaps between time-tied ob- 
jectives and acf ^ performance, on the dimensions of the 
objectives. A \t elaboration of the simple goal- 
achievement model of evaluation; a good basis for monitor- 
ing. 

DISPERSION (Stat.) The exient to which a distribu- 
tion is "spread" across the range of its variables, as opposed 
to, where it is "centered"— the latter being described by 
measures of "central tendency/' e.g. mean, median, mode. 
Dispersion is measured in terms of e.g. standard deviation 
or semi -interquartile rainge. > 

DISSEMINATION The process of distributing (typical- 
ly) a product itself, rather than information about it (cf. 
diffusion) Also used as jargon synonym for distribution. 

DISSONANCE See Consonance. 

DOMAIN-REFERENCED TESTING The purpose of 
testing is not usually to determine the testee's ability to 
answer the questions on the test, but to provide a basis for 
conclusions about the testee's ability with regard to a much 
Wider domain Criterion-referenced tests identify ability to 
perform at a certain 'criterion) level on — typically — a par- 
ticular dimension, e.g. two-digit multiplication Oh" a 
slight generalization of that to cover cases like social studies 
edi uation where it seems misleading to sufjgwi thai there is 
a criterion One can think of a domain as defined by a large 
h*I of criteria, from which we sample, just as — at the other 
end— the test samples from the testee's abilities. The major 
problem with DRT is defining domains in a useful way J R. 



55 46 



Popham has a usefully specific discussion in his Educa- 
tional Evaluation, Prentice-Hall # 1975. * 

DUMPING The practice of unloading funds^rapidly 
near the end of the fiscal year in order that they will not be 
returned to the central bureaucracy, which would be taken 
as a sign that next year's budget could be reduced by that 
amount since it wasn't needed. This fray be done with all v 
the trappings of an RFP # he., via a contract, but it's a situa- 
tion where the difference between a contract and a grant 4 
tends to evaporate since the contract is so unspecific (be- 
cause of lack of time for writing the RFP carefully) that it has 
essentially the status of a grant. From the agency's point of 
view, dumping is a sign of inadequate staff size, not a lack of 
need for the work that is RFP'd (as Congress often infers). 

ECHELON A term like "cohort," sometimes used in- 
terchangeably with the latter, but better restricted to a 
group (or group of groups) that is time-sfcaggered with 
regard to its entry. If a new group comes on board every 
four weeks for five months, followed by a three month gap, 
while they are being trained, and then the whole process 
begins again, the fir^t three groups are called the first eche- 
' Ion; each of them is a cohort. 3 

EDUCATIONAL ROLE (of the evaluator) It is both 
empirically and normatively the case that this role is of the 
greatest importance, at worst second only to the truthfind- 
ing role. This is not merely because few people have been 
properly educated as to either the importance or the tech- 
niques of evaluafion; it is because the discipline will prob- 
ably always seem unimportant until it (or its neglect) bites 
you, and quick education about that particular branch or 
application of evaluation will then become very important 
No professional who is unsophisticated about oersonnel, 
product, proposal and program evaluation in their field is a 
professional; but even when (or if) this sophistication is 
widespread, application of it to oneself and one's own prog- 
rams will not be easy, and the evaluator can help to teach 
one how to handle the process and its results. When 
Socrates said, "The unexamined life is not worth living," he 
was identifying himself as an evaluator; but it is not acciden- 
tal that he is best-known as a teacher Nor is it accidental 
that he was killed for combining the two roles. Sec also 

& 47 



ValuepWobia. 



\ 



EDUCATIONAL (OR OVERALL) SIGNIFICANCE To 

get this, theevaluator mustexamine the data corresponding 
to each of the prior checkpoints on the Key Evaluation 
Checklist: educational significance represents a total syn- 



thesis of all you know. In particular, the gains attributed to 



the program or product beipg evaluated, must be education- 



ally significant/ valuable and not just be statistically signif- 
icant, something which m$y only be the result of using a 
large sample, or due to irrelevant vocabulary gains, poor 
test construction, peculiar statistical analysis or some other 
insignificant variable. (The same applies for medically sig- 
nificant, socially significant, etc.) 

EFFECTIVENESS Usually refers to goal-achievement. 

Various indexes of effectiveness were developed around 
mid-century, when evaluation was thought of as simply 
goal-achievement measurement for social action programs. 
Success is a cognomen, merit and worth are the more im- 
portant evaluation predicates. 

EFFICIENCY Efficiency implies the absence of wastage 
for a given output, it can be increased by increasing output 
for a given input. It is perhaps more of a micro notion than 
effectiveness, i.e one can infer the function of components 
without needing to know the goals, of the project, to which 
effectiveness more obviously relates 

EFFORT, LEVEL OF A measure used in RFPs and 
evaluation as an index of resource input — hence important 
in evaluation of e g efficiency 



EIR See Environmental Impact Report, 

EMPIRICISM The epistemological doctrine that 
stresses the primacy of sensory knowledge In the philos- 
ophv of science, the (logical) positivists were amongst the 
most prominent and extreme empiricists Russel] described 
himself as a "logical empiricist." The contrast* ts with the 
(allegedly) evil tribe of metaphysicians who tend to be 
"idealists" (in the technical sense of believing that the mind, 
not the senses, is the primary basis of knowledge) The 
diluted version of empiricism that became the dominant 
ideologv of the social sciences essentially stressed the 





ERIC 



57 



r 



superiority of experiments and (public) observations ever a 
priori reasoning and introspect-, n. But it also involved a key 
holdover prejudice of the positivists, namely the rejection of 
all evaluative terms. While it is plausible to say they do not 
refer to observable properties, it was wrong to con- 
clude that they lacked objective reference since there, is 
another legitimate category of terms, namely those refer- 
ring to so-called theoretical terms, Le. terms referring to 
unpbserya ble entities/proces s es/st a tes who s e e xis tence can 
be inferred from (i.e., explains) the observable phenomena. 
• The positivists were originally keen to eliminate most unob 
servables as metaphysical entities, but this unfortunately 
leaves one without either extrapolable taxonomies, ade- 
quate explanations, or guides for future microyxplorations. 
The empiricist social scientists quickly drifted into accep- 
tance pf at least middle-level theories and^Eheir concepts, 
but forgot to check whether the taboojgfms of the evalua- 
tive vocabulary were thereby legitntfated. (Of course, they 
were using these terms all the time, in distinguishing good 
experimental designs from bad ones, good instruments 
from bad, etc.) The best excuse is perhaps that evaluative 
language consists of "theoretical terms" that relate to func- 
tions not observations, but then so does much of the lan- 
guage of mathematics and linguistics. The inexcusable be- 
havior was the failure to reconcile the obviously legitimate 
use of methodological evaluation and consumer product 
evaluation with the continued support for the doctrine of 
value-free social science. See Valuephobia. 

ENEMIES LIST Worst enemies often make best critics . 
They have two advartages over friends, in that they are 
more motivated to prove-you wrong, and mors? experieitced 
with a radically different viewpoint. Hence they will often 
probe deep enough to uncover assumptions one has not 
notjred, and destroy coKlglacence about the impregnability 
of one's inferential stnictu»g9^iously we should use 
them for metaevaluation, ancrpb/ th&ri well. But who en- 
joys working with, thanking, an^payW their enemies? 
The answer is; A good evaluator. This i^a key test of the 
"evaluation attitude" (see Evaluatio^killV ) How little we 
really care about the correct assessment of m^rit and how 
much we prefer to make life easy for oiAselve^sshows up 
nowhere more clearly than on this issue. A good example is 



» 



the distribution of teaching-evaluation forms to students in 
a college class, normally done near the end of the semester. 
But where are your enemies then? Long gone; only the 
self-selected remain. You should distribute the forms to 
every warm body that crosses thye threshold on the first day 
and any later date; to be turned in to their seat-neighbor 
when they decide not to come back. It is the ones Who left 
who can tell you the most — by now you know most of what 
the stalwarts will sayT if you v^u^quSIt^ reacfi out for 
suggestions to those who think you/ lack it. 

ENGINEERING MODEL- See Medical Model. 

ENJOVMENT Although it is an error in educational 
evaluation to treat enjoyment as primary and learning as 
not worth direct inspection, there's no justification for not 
counting enjoyment at all (Kohlberg once com r nted on 
the big early childhood program evaluations tha* was too 
bad no one bothered to check whether at least the kids cried 
/less in Headstart centers than at home.) And the situatiorvin 
/certain cases, e.g. aesthetic education, is much nearertoone 
where enjoyment is a primary goal. A common fallacy is to 
argue that since it would be a serious mistake to teach K-3 
children some cognitive skills at the expense of making 
them hate school, we should therefore make sure they enjoy 
school and try to teach ihem skills. That prioritization of 
effort reduces the already meager interest in teaching some- 
thing valuable, and has never been validated for gains in 
positive attitude towards school. The teacher is in conflict of 
irtterest here, since finger-painting takes less preparation 
than spatial skill-building 

ENTHUSIASM EFFECT See Hawthorne Effect. 

ENVIRONMENTAL IMPACT REPORT (EIR) Often 
required by law prior to granting building or business per- 
mits or variance. A form of evaluation focusing on the 
ecosystem effects. Currently based mainly on bio-science 
and/or traffic analysis, these tend to be thin on the evalua- 
tion of opportunity costs, indirect fosts, ethics contingency 
trees, etc. 

ERRORS OF MEASUREMENT It is a truism that 
measurement involves *ome error; it is more interesting to 
notice exactly how these errors can get one into trouble in 

ERIC 50 50 



evaluation studies. For example, it is obvious that if we 
select the low scorers on a test for remedial work, then some 
of these will be in the group because of errors of measure- 
ment (i,e. their performance on the particular test items that 
were used does not give an accurate picture of their ability, 
i.e. thejr were unlucky— or distracted). It follows that an 
immediate remeasurement, using a test of matched difficul- 
ty, would probably place them somewhat higher. Hence on a 
- postte s t, wh i ch i s esse ntially such a re te s t ing, they will corite 
out looking better, even if the intervening treatment lacked 
all merit. This is simply a statistical artefact due to errors of 
measurement (specifically, a regression effect). It also fol- 
lows that matching two groups on their entry level skills, 
where we plan to use one of them as the control in a 
<,urts/-experimental study (i.e. a study where the two groups 
are not created by random assignment) will get us into 
trouble because the errors of measurement on the two 
groups cannot be assumed to be the same, and hence the 
regression effect will be different in size. Anothertnasty 
effect of errors of measurement is to reduce correlation 
coefficients; one may intuitively feel that if the errors of 
measurement are relatively random, they should "average 
out" when one comes to look at the correlations, but the fact 
is that the larger the errors of measurement, the smaller the 
correlations will appear. See Regression to the Mean. 

ESCROW A neutral individual or secure place whtfre 
identifying data can be deposited until completion of an 
evaluation and/or destruction. {Term originated in the law.) 
See filter, Anonymity. 

ETHICS (in evaluation) See also Responsibility Evalu- 
ation. Ethics is the ultimate normative social science, ulti- 
mate because it refers to duties (etc.) which transcend all 
other obligations such as those to prudence, science, and 
professionalism. It is in one sense a branch of evaluation, in 
another a discipline which, Itke history or statistics, contri- 
butes a key element to many evaluations. That it is (logi- 
cally) a social science is of course denied by virtually all 
social scientists, who have valuephobia about even the 
suggestion that non-ethical value-judgments have a place 
in science and hypervaluephobia about importing ethical 
judgments. But the inexorable consequence of the develop- 

O '51 

EMS ■ cn 



merit of game- and decision-theory, latent function analy- 
sis, democratic, theory in political science, welfare eco- 
nomics, analytical jurisprudence, behavioral genetics, and 
the "good reasons" approach to ethical theory, is that aH the 
bricks have been baked for the building, and it's just super- 
stitious to argue that some mysterious force prohibits put- 
ting one on t6p of another. The Constitution and Bill of 
Rights are essentially ethical propositions, with two prpper- 
7 rTesfftrst, there are good reasons for adopting them; second, 
they generate sound laws. The arguments for them (e.g. 
Mill's "On Liberty") are as good social science as you'll find 
in a long day's walk through the professional journals, and 
the inferences to specific laws are well-tested. It follows that 
all the well-known arguments for law and order are indi- 
1 rectly arguments for the (secular) ethics of the Constitution 
and for the axiom of equal rights from which that flows, just 
as the arguments for the existence of atoms £re, indirectly, 
arguments for the existence of electrons. Ethics is just a 
general social strategy and no more immune to criticism by 
social science than the death penalty or excise taxes or 
behavior therapy or police strikes, To act as if some logical 
barrier prevents science from arguing for or against particu- 
lar ethical claims such as the immorality of the death pen- 
alty (a question of overall social strategy), but not from 
ir^uin^ for or against particular strategies within economics 
or p* nology is to cut the social sciences off from the most 
* important area in which they can make a social contribu- 
tion. And it leads to ragged edges on and inconsistencies 
within the sciences themselves. For an excellenf discussion 
of the "efhics-or-else" dilemma for allocation theory, see 
E.J. Mishan, Cost-Benefit Analysis, 1976, Praeger, Chapter 
58, 'The Social Rationale of Welfare Economics." Interest- 
ingly enough, although a large part of that book is about 
-evaluation (e.g. Chapter61 is called "Consistency in Project 
Evaluation"), neither that term nor the author's frequently- 
used variation "valuation" gets into the index. See Value- 
phobia. * 

EVALUABILITY Projects and programs— and the 
plans for them— are beginning to be scrutir ized quite care- 
fully for evaluability This might be thought of as the first 
commandment of accountability or as a refinement of Pop- 
per's requirement of fatsifialntittf. The underlying principle 





52 



be expressed in several ways, e.g. "1$ is not enough that 
good works be done, it must be possible to tell that (and, 
more importantly, when) good works have been done/' Or 
"You can't learn by trial and error4f there's no clear way to 
identify the errors." The bare requirement of an evaluation 
component in a proposal has been around for a while; 
what's new is a more serious effort to make it feasible and 
appropriate. That presupposes more expertise in evaluation 
than most r e vi e w p anel s and p r oje c t monitors have; but that ~ 
may come. Evaluability should be checked and improved at 
the planning and preformative stages. Requiring evaluabil- 
ity of new programs is analogous to requiring serviceability in 
a new car; obvious enough, but who besides fleet owners 
(and GSA) knew that there was for many years a 2:1 differ- 
ence in standard service costs as between Ford and GM? 
Congress may some day learn that low evaluability has a 
high price. 

EVALUAND Whatever is being evaluated; if it is a per- 
son, the term "evaluee" is more appropriate. 

EVALUATION The process of determining the merit or 
worth or value of something; or the product of that process. 
The special features of evaluation, as a particular kind of 
investigation (distinguished e.g. from traditional empirical 
research in the social sciences), include a characteristic con- 
cern vvith cost, comparisons, needs, ethics^ and its owtt 
political, ethical, presentational, and cost dimensions; and 
with the supporting and making of sound value judgments, 
rather than hypothesis-testing. The term is sometimes used 
more narrowly (as is "science") to mean only systematic 
and objective evaluation, or only the work of people labeled 
"evaluators " While evaluation in the broad sense is ines- 
capable for rational behavior or thought professional eval- 
uation is frequently worthless and expensive. Evaluation — 
properly done— can be said to be "a science" in a loose 
sense, as can, for example, teaching; but it is also an art, an 
inter-personal skill, something that judges and juries and 
literary critics and real estate assessors and jewelry apprais- 
ers do — and thus not "one of the sciences." See also 
Formati ve/Summati ve, Analytical/Holistic, etc. 

EVALUATION EDUCATION Consumer education is 
still rather weak on training in evaluation, which should be 

5.1 

62 




its most important component. And of course there are 
other contexts than those in which one's role is that of the x 
consumer, where evaluation education would be most val- 
uable, notably the manager role, or the service-provider / 
professional role. Few teachers, for example, have the faint- 
est idea how to evaluate their own work, although this is 
surely the minimum requirement of professionalism. The 
last decades have seen considerable federal and state effort . 
to provide reasonable staMaTcfs of quality that wili protect 
the consumer in a number of areas; they have not yet really 
understood that the superimposition of standards is a poor 
substitute for** understanding the justification for them. 
Evaluation training is the training of (mainly professional) 
evaluators; evaluation education is the training of the citi- 
zenry in evaluation techniques, traps, and resource-find- 
and is the only satisfactory long-run approach ^o 
improving the quality of our lives without extraordinary 
wastage of resources. 

EVALUATION ETHICS AND ETIQUETTE Because 
• evaluation in practice so often involves tricky interpersonal 
relations it has much to learn from diplomacy, arbitration, 
mediation, negotiating, and management (especially per- 
sonnel management). Unfortunately, the wisdom of these 
areas is poorly encapsulated into learning and training 
Materials, which are mainly truishc or anecdotal. The cor- 
rect approach would appear to be via the refinement of 
o normative principles and the collateral development of ex- 
tensive calibration examples, rather as in developing skill in 
applied ethical analysis (casuistry.) An example: you are the 
onjy first-timer on a site-visit team to a prestigious institu- 
tion, and you gradually realize, as the time slips away in 
socializing and reading or listening to reports from adminis- 
trators and administration-selected faculty, that no serious 
evaluation is going to occur unless you do something about 
it. What should you do? There is a precise (flow-chartable) 
solution which specifics a sequence of actions and utter- 
ances, each contingent upon the particular outcome of the 
previous act, and which avoids unethical behavior while 
minimizing distress; mature professionals without evalua- 
tion experience xiever get it right; some very experienced 
and thoughtful ^valuators come very close; a group contain- 

ERIC - 63 54 



ing both reaches complete consensus on it after a twenty- 
minute disusskm. Like so much in evaluation, this shows it 
meets the standards of common-sense though it is not in 
our individual repertoires. It should be. Another example: a 
write-in response on an anonymous personnel evaluation 
form accuses the evaluee of sexual harassment. As the per- 
son in charge of the evaluation, what exactly should you do? 
("Ignore it" is not only ethically wrong, it is obviously 
impossible.) — 



EVALUATION OF EVALUATIONS. See Meta-eval- 
uation. 

EVALUATION OF EVALUATORS Track record, not 
publications, is the key, but how do you get it? See Evalua- 
te* Registry, Big Shops. 

EVALUATION PREDICATES The distinctively evalu- 
ative relations or ascriptions involved in grading, ranking, 
scoring, and apportioning, as ways to determine worth, 
merit or other value. A huge list of other evaluative terms 
are appropriate to some contexts rather than others e.g. 
validity (of tests or news stories), integrity (of security sys- 
tems or personnel), adequacy, appropriateness, effective- 
ness, plausibility. These predicates can be construed as 
serving an infonnation-fompression function in language, 
combining performance data and needs assessment or 
standards data into a concise package, of which the letter 
grade for coursework is the paradigmatic example. 

EVALUATION REGISTRY A concept half-way to the 
certification or licensing of evaluators from complete 
laissez-faire. This would operate by encouraging evaluators 
and their clients to file a copy of their joint contract or letter 
of agreement with the evaluation registry at the beginning 
of an evaluation; to this would be appended any modifica- 
tions made along the way and finally a brief standard report 
by each party, made independently, assessing the quality 
and utility of the evaluation, and the performance of the 
client. Each would have a chance to add a brief reaction to 
the other's evaluation, and the net end result (2 pages) 
would then be available for inspection, for a fee, by poten- 
tial clients. This arrangement, it is argued, would be of more 
use to the client than asking an evaluator to suggest former 

° 55 



clients as references or simply looking at a list of publica- 
tions or reports, but would avoid the key problems with 
licensing — enforcement standards, and funding. Start-up 
costs for such a registry, although small, are not available, 
possiblv because we are in a period of evaluation backlash. 
See Directory of Evaluation Consultants (The Foundation 
Center, 1981). 

'EVALUATION RESEARCH -Evaluation done in a seri- _ 
ous scientific way; the term is popular amongst supporters 
of the social science model of evaluation. See Introduction 
to this book. 

EVALUATION-SPECIFIC METHODOLOGY Much af 
.the methodology used in evaluation studies is derived from 
other disciplines -the special nature of evaluation is the 
way in which it selects and synthesizes these contributions 
into an appropriate overall perspective, and brings them to 
bear on the various kinds of evaluation tasks But there are 
some situations where substantial variations on the usual 
procedures in scientific research become appropriate. Two 
instances will be mentioned. In survey research, sample 
size is normally predetermined in the light of statistical 
considerations and prior evidenc r ibout population param- 
eters. In evaluation, although ti.ere are occasions when a 
survey of the classical kind is approt *e, surveys are fre- 
quentlv investigatory rather thartdtwi/tf/iv surveys and then 
the situation is rather different Suppose that a respondent 
in a phone interview evaluation survey of users of a particu- 
lar service, comes up with a wholly -unexpected comment 
on the service which suggests— let us say— improper be- 
havior by the service-providers. (It might equally well sug- 
gest an unexpected and highly beneficial side effect.) This 
respondent is the thirtieth interviewee, from a planned 
sample of.a hundred. On the standard survey pattern, one 
would eontinu<*T using the same interview form, through 
the rest of the sample. In evaluations, one will quite often 
want to alter the form so as to include an explicit question on 
this point. Of course, one can no longer report the results of 
the survey with a sample of a hundred, with respect to this 
question (and any others with which its presence might 
interact). But one may very well be able to turn up another 
twenty people that respond under cueing, who would not 



have produced this as a free response. That result is much 
more important than salvaging the survey — in most cases. 
It also points to another feature of the evaluation situation, 
namely the desirability 6f time-sequencing the interviews 
or questionnaire responses. Hence one should try to avoid 
% using a single mass mailing, a common practice in survey 
research; by using sequential mailing one' can examine the 
responses for possible modifications of the instrument The 
second taboo that we may have good reason to break con- 
cerns sample size. If we find ourselves getting a very highly 
, standard kind of response to * fairly elaborate question- 
naire, we are discovering that the population has less varia- 
bility than we had expected, and we should alter our esti- 
mate of an appropriate sample sLe in mid-stream. No point 
in continuing to fish in the same waters if you don't get a 
bite after an hour. The generalization of this point is to the 
we of "emergent", "cascading", or "rolling" designs, 
where the, whole design is varied en route as appropriate. 
(These terms come from the glossary in Evaluation Stan- 
dards.) Other evaluation-specific methodology includes the 
use of parallel teams working independently, calibration of 
judges, convergence sessions, "blind" judge synthesis, 
the avoidance of data-gathering unless the data are neces- 
sary for replication or are indicators of merit, bias balancing 
etc. See also Anonymity, Questionnaires. 

EVALUATION SKILLS There are lists of desirable 
skills for evaIuatorv (Stufflebeam has one with 234 com- 
petencies); as for philosophers, almost any kind of special- 
ized knowledge is advantageous, andfhe more obvious tool " 
skills alone (see the Key Evaluation Checklist) are far more 
demanding than in any other discipline— statistics, cost- 
analysis, ethical analysis, management, teaching, therapy, 
contract law, graphics, synthesis dissemination (for the re- 
port); and of course there are the evaluation-specific tech- 
niques. Here we mention a couple that are less obvious. 
First, the evaluative attitude or temperament. Unless you 
are committed to the search for quality, as the best of those 
in e,g. the legal or scientific professions are committed to 
the search for justice or the search for truth, you are in the 
wrong game. You will be too easily tempted by the charms 
of "joining" (e.g. joining the program staff— see Going 
Native); too unhappy with the outsider's role. The virtue of 

ER?C 57 6C 



evaluation must be its own jea\ reward, for the slings and 
arrows are very real. (Incidentally, this valuers a learnable 
and probably even a teachable characteristic for many 
people^but some people come by it naturally and others will 
never acquire it.) The second package of relatively unpro- 
claimed skilly are "practical logical analysis" skills e:g. 
identifying hidden agendas or unnoticed assumptions 
about the dissemination process; or mismatches between a 
goal-statement and a needs-statement; of loopholes ut ah - 
evaluation design; the ability to, provide accurate sum- 
maries one fiftieth of the length of the original (precis) or to 
give a totally non-evaluative, non-interpretive description 
of a program or treatment The good news is that no-one is 
good at all the relevant skills; that there is room for special- 
ists, and also for team members. Partly because of the form- 
idable nature of the relevant skills list, evaluation is a field 
where learns it properly employed are immensely better than 
soloists. Not^ only are two heads better than one, six 
heads — if carefully chosen and appropriately instructed — 
are better than five 

EVALUATION STANDARDS A set of principles for 
t\\e guidance of ^valuators and their clients. The major 
effort is the Evaluation Standards (ed. D. StufflebeaQi, 
McGraw Hill, 1980), but the Evaluation Research Society 
has also produced a set. There are some shared weak- 
nesses — for example, neither includes nee^^rtTssment — 4 
but the former is much more explicit about interpretation, 
giving specific examples of applications etc. Iri general, 
these are likely to do good ftv raising clients' consciousness 
and general performance, but fears have been expressed by 
first-rank evaluators that they may rigidify approaches, 
stifle research, increase costs (cf "defensive lab tests" in 
medical practice todav), and give a false impression of 
sophistication. See also Bias. 

EVALUATION, THEORY OF The theory of evaluation 
includes a wide range of topics from the logic of evaluative 
discourse, generai accounts of the nature of evaluation and 
how it can be justified (axiology), through socio-political 
theories of its role in particular types of environment, to 
so-called "models" which are often simply conceptualiza- 
tions of or procedural recommendations for evaluation. 



ERIC 



Little work is funded on this; a Rotable exception is NIE's 
Research on Evaluation project at NWL, a series of studies 
on radically different "metaphors" for evaluation. 

EVALUATION TRAINING There is essentially no 
serious support for this at the moment, despite the large 
demand (and larger need) for trained evahiators, perhaps a 
sign of evaluation backlash. The best places are probably 

'_CI]RCE and the Evaluation Center at Western Michi gan with 

• post-doc woflTat the Northwest Late. Short courses are 
more widely available and advertised in Evaluation News. 
See also Training ot Evaluators. 

EVALUATIVE ATnfUtfe See Evaluation SkUb. 

EVALUEE A person being evaluated; the more general 
term, which covers products and programs, etc., is "eval- 
uaftd." , 

EXECUTIVE SUMMARY Abstract of results from an 
1 ^evaluation, usually in non-technical language. 

EXIT INTERVIEWS Interviews with sutjjecte as they 
leave e.g. a training program or clinic, to obtain factual and 
judgmental data. A very good time for these, with respect to 
course or teaching evaluation in the school or college set- 
ting, is at the time of graduation, when, (a) the student will 
have some perspective on mfcst of the educational expert- 9 
ence; (b) fear of retribution is low; (c) response rate can be 
nearhU00% with careful planning; (d) judgments of eftects 
are relatively uncomplicated e.g. by work-experic nee as an 
extra causal factor; (e) memory is still fresh, Late/ than 
this— alumni surveys—conditions can and do deteriorate, 
though there is a partial offset because job-relevance can be 
judged more accurately 1 . 

EXPERIMENT See True Experiment. 

EXPERIMENTAL GROUP The group (or single per- 
son, etc.) that is receiving the treatment being studied. Cf. 
control group. 

EXPLANATION 1. By contrast with evaluation, which 
, identifies the value of something, explanation involves an- 
swering a Why or How question about it, or other type of 
request for understanding. Often explanation involves 

O • ,59 * 

ERIC es 



0 

ERIC 



finding the cause of a phenomenon, ra ,r » its effects 

(which is a major part of evaluation) Whc ' > possible, 
without jeopardizing the mam goals of what may be holistic 
summative evaluation, i xxi evaluation design tries to 
uncover micro explanations (e.g, by identifying those com- 
ponents of the curriculum package which are producing the 
major part of the effects, orlhose which are having little 
effect) The first priority, however, it to resolve the evalua- 
tion issues (Is the package the best available? etc. ). Too often 
the research orientation and training of evaluates leads 
them to do a poor job on evaluation because they got in- 
terested in explanation (LE). Even then, an explanation *n 
terms of a conceptual scheme may not involve much of a 
diversion, whereas we can ill afford a search for a theory. 
The realization that the logical nature and investigator)' de- 
mands of evaluation are quite different from those of expla- 
nation is as important /is the corresponding realization with 
respect to prediction and explanation, which the neo-posi- 
tivist philosophers of science still think are logically the 
same under the (temporal) skin. 

2. Explanation of an ^valuation is sonething else it may 
involve (a) .translating technicalities, (b) unpacking the 
' , *!*tv indicators about the evidence, i e , justifying; 
(c) exhibiting the mitro-evaluations on separate dimensions 
that added up to the global rating (d) giving component 
evaluations 

EX POST FACTO DFSIGN One where we identify a 
control group "after the fact," i.e., after the treatment has 
occurred A ver% mucn weaker design than the true experi- 
ment since there must have been something different about 
the subjects that got the treatment without being assigned 
to it, in order to explain whv they got it, and that something 
means thev're not the same as the control group, in some 
unknown respect that mav be related to the treatment 

EXTERNAL fevalua'or or evaluation) An external 
e valuator is someone who is at lea±t not on the project or 
program regular staff, or someone — in the case of person^ 
nel evaluation - other than the individual being evaluated, 
or their staff It ts better it they are not even paid by the 
project or >v anv entitv with a prior preference for the 
success or failure of the project Where or to whom the 

60 

60 



external evaluator reports n what determines whether the 
evaluation is formative or summative, either of which may 
be done by external or by internal evaluators (contrary to the 
common view that external is for summative, internal for 
formative), and both of which should be done by both. 

EXTERNAL VALIDITY By contrast with internal 
validity, this refers to the generalizability of the experi- 
mental/evaluation findings. Here the traps to avoid include 
failure to identify key environmental variables that happen 
to be constant throughout the experiment, decreased sensi- 
tivity of participants to treatment at posttest due to pretest, 
reactive effects of experimental arrangement, or biased 
selection of participants that might affect the generalizabil- 
ity of the treatment's effect to non-participants— thus jeop- 
ardizing the external validity. (Ref. Experimental and Quasi- 
Experimental Designs for Research, D.T. Campbell and J.C. 
Stanley, Rand McNally & Co., Chicago, 1972, and Validity 
Issues in Evaluative Research, ed. Bernstein {Sage, 1976)), 
The references discuss the classical conception of Validity 
in evaluation, but this is only part of the problem. Content 
validity is extremely important in evaluation and essen- 
tially not discussed in these (typical) references. See Gener- 
alizability. 

EXTRAPOLATE Infer conclusions about ranges of the 
variables beyond those measuured. Cf. Interpolate. 

FACE VALIDITY The apparent validity, typically of 
test items or of tests; there can be skilled and unskilled 
judgments of face validity, and highly skilled judgments 
which come pretty close to content validity, which does 
require systematic substantiation. 

FADING Technique used in programmed texts, where 
a first answer is given completely, the next one in ™ri with 
gaps, then with just a single cue, thcr. called for without 
help A key technique in training and calibrating evaluators. 

FAULT TREE ANALYSIS (CAUSE TREE ANALYSIS) 

These terms emerged about 1965, originally in the literature 
of management science and sociology. They are sometimes 
used in a highly technical sense, but are useful in a straight- 
forward sense Basically, the model to which they refer is 
the trouble-shooting chart, often to be found in the pages ot 

er|c " 70 



e.g. a Volkswagen manual The branches in the tree identify 
possible causes of the fault (hence the terms "causq" and 
"fault" in the phrase), and this method of representation — 
with various refinements— ia used as a device for manage- 
ment consultants, for management training, etc. Its main 
use in evaluation is as a basis for needs assessment. 

FIELD INITIATED This refers to proposals or projects 
for the funding of grants or contracts that originate from 
workers in the field of study, rather than from a program 
announcement of the availability of funds by an agency for 
work in a certain area (which is known as "solicited" re- 
search or development. ) 

FIELD TRIAL (OR FIELD TEST) A dry run of a test 
of a *roduct/program, etc. Absolutely mandatory in any 
serious evaluation or development activity. It is essential 
that at least one trie field trial c hould be done in circum- 
stances and with a population that matches the targeted 
situation and population. Earlier ("hothouse") trials may 
not meet this standard, for convenience reasons, but the last 
one must. Unless run by external evaluators (very rare), 
there is a major nsk of bias in the sample or conditions or 
content or interpretations used by the developer in the final 
field trials. 

FILTER Someone who — or a computer which — 
removes identifying information from evaluative input, to 
preserve the anonymity of the respondent. / 

-FISCAL EVALUATION The highly developed sub- 
field that involves looking at the worth or probable worth of 
e.g. investments, programs, companies. See ROI, Payback, 
Time Discounting, Profit, etc. 

FISHING Colloquialism for exploratory (phase of) re- 
search; or for true nature of large slices of serious (e.g. 
program) evaluation; or for visits to Was^'ngtcn in search of 
funding support 

FLOW CHART A graphic representation of the sequence 
of decisions, including contingent decisions, that is set up to 
guide the n.^nagement r £ projects (or the design of compu- 
ter progran,s), including evaluation projects. Usually looks 
like a sideways organization diagram, being a series of 



ERIC 



62 




boxes and triangles ("activity blocks/' etc.) connected by 
lines and symbols that indicate simultaneous or sequential 
activities/decision points, etc. A PERT chart is a special 
case. % 

FOCUS (of a program) A more appropriate concept for 
most evaluation than "goal"; both are theoretical concepts, 
both serve to limit complaints about things not done to the 
general area where resources are available/legitimately use- 
able. The focus of a program is often improved by good 
evaluation. 

FORMATIVE EVALUATION Formative evaluation is 
conducted during the development or improvement of a 
program or product (or person, etc.). It is an evaluation 
which is conducted for the in-hoiise staff of the program and 
normally remains in-house; but it may be done by an internal 
or an external evaluator or (preferably) a combination. The 
distinction between formative and summative has been 
well summed up in a sefttence of Bob Stake's "When the 
cook tastes the soup, that's formative; when the guests taste 
the soup, that's summative." Typically, formative evalua- 
tion benefits from analytic evaluation, but holistic evalua- 
tion plus trial-and-error or expert advice u ill also work and 
may sometimes be all that is possible. Analytic evaluation, 
in turn, may or may not involve/require/produce causal 
analysis, so the connection between evaluation and causae 
Hon is pretty remote, contrary to W. Edwards DemwigV 
oft-quoted remark, "Evaluation is a study of causes. " 

FOUND DATA Data that already exists, pribr to the 
evaluation— contrast is with experimental data or test and 
measurement data. 

FUGITIVE DOCUMENT One which is not published 
through the public channels as a book or journal article. 
Evaluation reports have often been of this kind. ERIC (Edu- 
cational Resources Information Center) has picked up some 
of these, but since its standards for selection are so variable 
and <ts selection so limited, time spent in searching it is all 
too often not cost-effective. 

FUNDING (of evaluations) Done in many ways, but 
the most common patterns are described here. The evalua- 
tion proposal may be "field-initiated," i.e. unsolicited, or 



63 



ERIC 



72 



sent in.response *o (a) a program announcement, (b) an RFP 
(Request for Proposal), (c) a direct request. Typically (a) 
Jesuits in a grant, (b) in a contract; the former identifies a 
general charge or mission (e.g. "to develop improved tests 
for early childhood affective dimensions") and the latter 
specifies more or less exactly what is to be done, e.g. how 
many cycles of field tests (and who is to be sampled, how 
large a sample is to be used, etc.), in a "Scope of Work." The 
legal difference is that the latter is enforceable for lack of 
performance, the former is (practicaPy) not But it scarcely 
makes sense to use contracts for research (since you usually 
can't foresee which way it will go), and it is rarely justifiable 
to use them for the very specific program evaluations re- 
quired by law. Approach c, "sole-sourcing," eliminates 
competitive bidding and can usually only be justified when 
only one contractor has much the best combination of rele- 
vant expertise or equipment or staff resouces; but it is much 
faster, and it does avoid the common absurdity of 40 bid- 
ders, each spending 12K ($12,000) to write a proposal worth 
300K to the winner. The wastage there (180K) comes out of 
overhead costs which are eventually paid by the taxpayer, 
or by bidders going broke because of foolish requirements. 
A good compromise is the two-tier system, all bidders sub- 
1 milling a two (or five or ten) page preliminary proposal, the 
best few then getting a small grant to develop a full pro- 
posal. Contracts may or may not have to be awarded to the 
lowest "qualified" bidder; qualification may involve finan- 
cial^ resources, stability* prior performance, etc., as well as 
technical and management expertise. On big contracts there 
is usually a "bidders' conference" shortly after publication 
of the RFP (it's often required that federal agencies publish 
the RFP in the Business Commerce Daily and/or the Federal 
Register). Such a conference officially serves to clarify the 
RFP; it may in fact be a cross between a con job and a poker 
game. If you ask clever questions, others may (a) be scared 
off, (b) steal your approach, etc. The agency may be sniffing 
around for a "friendly" evaluator and the evaluators may be 
trying to look friendly but not so friendly as to reduce 
credibility, etc Eventually, perhaps attei a second bidders' 
conference, the most promising bidders will be asked for 
their Best and Final bid and on this basis the agency selects 
one, probably using a possibly anonymous external review 



64 

7(J. 



panel to lend credibility to the selection. After the first 
conference between the winner and the project officer (the 
agency's representative used to be called the monitor)- it 
often turns out that the agency wants or can be persuaded to 
want something done that isn't clearly in the contract; the 
price will then be renegotiated. Or if the price was too low 
(the RFP will often specify it in terms of "Level of Effort" as 
N "person-years" of work; this may mean N x 30K or 
N x 50K in dollar terms, depending on whether overhead 
is an add-on) to get thi? job done, the contractor may just go 
ahead till they run out of money and then ask for more, on 
the grounds the agency will have sunk so much in and be so 
irreversibly committed (time-wise) that they have to come 
through to "save their investment." The contractor of 
course loses credibility on later bids but that's better than 
bankn? y; and the track-records are so badly kept that no 
one ma> nold it against them (if indeed they should). In the 
bad old days, low bids were a facade and renegotiation on 
trumped-up grounds would often lead to a cost weH above 
that of another and better bidder. Since evaluations are 
tricky to do in many ways, bidders have to allow a pad in 
their budget for contingencies — or just'cross their fingers, 
which quickly leads to bankruptcy. Hence another option is 
to RFP for the best design and per diem and then let the 
contract for as long as it takes to do it. The form of abuse 
associated with this cost-plus approach is that the contrac- 
tor is motivated to string it out. So no overall clear saving is 
attached to either approach; but the latter is still used where 
the agency wants to be able to change targets as preliminary 
results come in, a sensible point, and where it has good 
monitoring staff to prevent excessive over-runs (from esti- 
mates which of course are not binding) . A major weakness in 
all of these approaches is that innovative proposals will 
often fail because the agency has appointed a review panel 
of people committed to the traditional approaches who 
naturally tend to fund "one of their own.7 Another major 
weakness is the complexity of all this, which means that big 
organizations who can afford to open branches in D.C., pay 
professional proposal-writers and "liaison staff" (i.e., lob- 
byists), have a tremendous edge (but often do poor work, 
since most of the best people do no work for them). A third 
key weakness is that the system described favors the pro- 



65 



duct of timely paper rather than the solution of problems, 
since that's all the monitoring and managing process can 
identify. Billions of dollars, millions, of jobs, thousands of 
lives are wasted because we have no reward system for 
reallv good work, that produces really important solutions. 
The reward is for the proposal, not the product; and the 
reward is the contract. Once obtained, only unreliability in 
delivery or gross negligence jeopardizes future awards. You 
can see the value system this arrangement produces from 
the way the vice-presidents all move on to work on the next 
"presentation" as soon as negotiation is complete. It would 
onlv cost pennies to reverse this procedure via (partial) 
contingency awards and expert panels to review work done 
instead of proposals. 

FUTURISM Since many evaluands are designed to 
serve future populations and not (just) present ones, much 
evaluation requires estimating future needs and perfor 
mance. The simpler aspect of this task involves extrapola- 
tion of demographic data; even this is poorly done e g. the 
crunch on higher education enrolments was only foreseen 
by one analyst (Cartter) although the inference was simple 
enough. The harder task is predicting e.g. vocational pat- 
terns twenty years ahead. Here one must fall back on 
possibilitv-covenng techniques, rather than probability- 
selection e.g. by teaching flexibility of attitude orgeneraliz- 
able skills. 

GENERALIZ ABILITY (Cf External Validity) Al- 
though external validity & commonly equated with general- 
izabilitv, it refers to only part of it. Typically one wants to 
generalize to populations (etc.) essentially other than the one 
tested, not just extrapolably other; and it's not just population 
differences but treatment differences and effect differences 
that are of intere st. (See the generalizability checkpoint in 
the Key Evaluation Checklist) In short, the generalizability 
of eternal vahditv is akin to that of inferential statistics — we 
might call it short-range generalization. But science and evalu- 
ation are constantly pushing for long-range generalization. 
involving tenuous inductive or imaginative leaps that re- 
quire investigation. Generalization is thus often nearer i 
speculation than to extrapolation. And the value of things/ 
people is often crucially affected by their versatility, i.e. 



-ERIC 




66 



utility across generalization. 




GENERAL-PURPOSE EVALUATOR Someone with a 
wide range of evaluation expertise, not identified with a 
particular field/area/disripline. (The contrast is with a local 
expert.) It helps to have and the term may connote experi- 
ence outside a type of evaluation, too, e.g. outside program 
evaluation, perhaps in policy/product/personnel, or out- 
side the accreditation type of evaluation. The CPs weak* 
ness is a lack of local knowledge— but tKis trades off against 
a lack of local biases. The best arrangement is the use of two 
(or more) evaluators, one local and one general purpose. 
See Shared Bias. 



-GOAL . The technical sense of this term restricts its use 
to rather general descriptions of intended outcome; more 
specific descriptions are referred to as objectives. It is im- 
portant to realize that goals cannot be regarded as observ- 
able features of programs (or products, services or systems). 
There is often an announced, official or original goal—but 
usually several, in which case the problem of how to weight 
their relative importance comes up and is rarely answered 
with enough precision to pin down the correct evaluative 
conclusion. Usually, too, different stakeholders have diffe- 
rent goals— conscious and unconscious— for a program; 
and most of these change with time. The best one can hope 
for is to get a general sense of a program's goals as theoreti- 
cal constructs from the paper trail, interviews and actions. 
Beginners often think the goals are the instigator's official 
original goals, just as they think the program's correct de- 
scription comes from the same source. Both are hard to get- 
since the goal-hunt is not only difficult but often unneces- 
sary and also often biasing, one should be very careful to 
avoid it whenever possible. See Goal Free E/aluation. 

GOAL-ACHIEVEMENT MODEL (of evaluation) The 
idea that the merit of the program (or person) is to be 
equated with success in achieving a stated goal. This is the 
most naive version of goal-based evaluation. 

GOAL-BASED EVALUATION (GBE) This type of 
evaluation is based and focused on knowledge of the goals 
and objectives of the program, person or product. A goal- 



Gl 




fAL (evaluation) See Holistic. 



ERIC 



67 



7C 



based evaluation often does not question the merit of goals; 
often does not look at the cost side of cosf-effectiveness; 
often fails to search for or locate the appropriate critical 
competitors; often does not search systematically for side 
effects or valid process parameters such as ethicality; in 
• short, often does not include a number of important and 
necessary components of an evaluation. Even if it does 
include these components, they are referericed to the pro 
gram's (or to personal) goals and hence the approach is 
likely to be involved in serious problems such as identifying 
these goals, handling inconsistencies in them and changes 
in them over time, dealing with shortfall and overrun out- 
comes, and avoiding the perceptual bias of knowing about 
them . GBE is not just goal-achievement evaluation — it can, 
in principle, involve criticism of the goals. But such criticism 
.would presumably involve a needs assessment of the im- 
pacted population— and once one has that, direct compari- 
son of it with the total effects of the program yields an 
estimate of worth without reference to the goals, i.e., GFE. 
* So GBE is either too narrow or unnecessary— except as a 
convenient reconceptualization of an evaluation for the 
benefit of pjaqners or managers who understandably must 
use a goaj^frante^vork and would like feedback about the 
program'^ effectiv^es^jn^ those terms. GBE is manager- 
oriented evaluation, close T<T monitoring and far from 
consumer -oriented evaluation. (See GFE). Defining evalua- 
tion as the^tudy of the effectiveness or success of programs 
is a sign of (often unconscious) acceptance of GBE. 

GOAL-FREE ^VALUATION (GFE) In this type of 
evaluation, the evaluator is not told the purpose of the 
program but enters into the evaluation with the purpose of 
finding out v^hat the program actually is doing without 
being cued as to what it is trying to do. If the program is 
achieving its stated goals and objectives, then these 
achievements should show up (in observation of process 
and interviews with consumers (not staff)); if not, it is ar- 
gued, they are irrelevant. Merit is determined by relating 
program effects to the relevant needs of the impacted popula- 
tion, rather than to the program (i.e., agency or citizenry or 
congressional or manager's) goals. It could thus be called 
"needs-based evaluation" or "consumer-oriented evalua- 
tion" by contrast with goal-based or manager-orientect 

ERIC 77 - 



evaluation. It does not substitute the evaluators goals nor 
the goals of the consumer for the program's goals; the 
evaluation must justify (via *he needs assessment) all as- 
signments of merit. GFE is generally disliked by both 
managers/administrators and evaluators, for fairly obvious 
reasons: it raises anxiety by its lack of predeterminate struc- 
ture. It is said by supporters to be less intrusive than GBE, 
more adaptable to mid-stream goal shifts, better at finding 
side effects and less prone to social, perceptual and cogni- 
tive bias. It is risky, because the client may get a nasty shock 
when the report comes in (no prior hand-holding) and 
refuse to pay because embarrassed at the prospect of having 
to pass the evaluation along to the funding agency. (But if 
the findings are invalid, the client should simply document 
this and ask for modifications.) GFE is reversible, a key 
advantage over GBE; hence an evaluation design should 
(sometimes) begin GFE, write a preliminary report, then go 
to GBE to see if serious errors of omission occurred. (Run- 
ning a parallel GFE effort along with a GBE reduces the 
time-span.) The shock reaction to GFE in the area of pro 
gram evaluation (it is the standard procedure used by all 

Consumers evaluating products) suggests that the grip of 
management bias on program evaluation was very strong, 
and possibly that managers felt they had achieved consider- 
able control over the outcomes of GBEs. GFE is analogous to 
double-blind design in medical research; even if the eval- 
uator would like to give a favorable report (e.g. because of 
being paid by the program, or because hoping for future 
work from them) it i« not (generally) easy to tell hozv to 
"cheat" under GFE conditions. The risk of failure by an 
evaluator is of course greater in GFEs, which is desirable 
since it increases effort, identifies incompetence, and im- 
proves the balance of power. 

GOING NATIVE The fate of evaluators that get co- 
opted by the programs they are evaluating. (Term origi- 
nated with the Experimental Schools. Progam evaluation in 
mid-60's.) The co-option was often entirely by choice and 
well illustrates the pressures on, temptations for, and hence 
the temperamental requirements for being a good evalu- 
ator It can be a very lonely role and if you start thinking 
about it in the wrong way you start seeing yourself as a 

69 73 




negative force — and who wouldn't rather be a co-author 
than a (mere) critic? One answer; someone who cares more 
about quality than kudos. See Evaluation Skills. In an- I 
thropological field research, a closely related phenomenon 
is known as the "my tribe" syndrome, characterized by 
proprietary attitudes towards the subjects of one's field 
work and defensive attitudes towards one's conclusions 
about them. See Independence. 

GRADE-EQUIVALENT SCORE A well-meant attempt 
to generate a meaningful index from the results of stan- 
dardized testing. If a child has a 7.4 grade-equivalent score, 
that means s/he is scoring at the average level (estimated to 
be) achieved by students four months into the 7th grade. 
Use of the^6ncept has often led to an unjustified worship of 
average-scores as a reasonable standard for individuals, and 
to overlooking the raw scores which may tell a very different 
story. Suppose a beginning eighth grader is scoring at the 
7.4 level; parents may be quite upset unless someone points 
out that on this particular test the 8.0 level is the same as the 
7.4 level (because of summer backsliding). In reading, a 
deficit of two whole grade equivalents is quite often made 
up in a few months in junior high school if a teacher suc- 
ceeds in motivating the student for the first time. Again, a 
student may be a whole grade-equivalent down and be 
ahead of most of the class— if the average score is calculated 
as the mean not the median. Again, a student in the fifth 
grade scoring 7 2 might flunk the seventh grade reading test 
completely; 7.2 just means that s/he scores where a seventh 
grader would score on the Mth grade test. A year's deficit 
from the 5th grade norm isn't comparable to a year's deficit 
from the 4th grade norm. And so on— i.e., use with 
caution But don't throw it out unless vou have something 
better for audiences not made up of statisticians 

GRADING ("Rating" is sometimes used as a syno- 
nvm.> Allocating individuals to an ordered (usually small) 
set of abeled categones, the order corresponding to merit, 
e.g. A -F for "letter grading." Those within a category are 
regarded as tied if the letter grade only is used; but if a 
numerical grade ("scoring'') is also used, they may be 
ranked within grades. The use of plus and minus grades 
simply amounts to using more categories. Grading pro- 

ERIC 70 70 



vides a partial ranking, but ranking cannot provide grading 
without a further assumption, e.g. that the best student is 
good enough for an A, or that "grading on the curve" is 
justified. That is, the grade labels normally have some inde- 
pendent meaning from the vocabulary of merit ("excel- 
lent" etc.) and cannot be treated as simply a sequenced set 
of categories separated by making arbitrary cuts in a ranked 
sequence of individuals. In short, the grades are normally 
criterion-referenced; it is ranking that is facilitated by 
norm-referenced testing: that distinction frequently results 
in confusion. For example, grading of students does not 
imply the necessitvJor "beating" other students, does not 
need to engenderQ'djstractive competitiveness" a* is often 
thought. Only publtaztil grading on a cum Joes that. Pass/ 
Not Pass is a simple form of grading, not a no-grading 
system. Grades should be treated as quality estimates by an 
expert and thus constitute essential feedback to the learner 
or consumer; corrupting that feedback because the external 
society misuses the grades is abrogation of duty to the 
learner or consumer, a confusion of validity with utiliza- 
tion. See Responsibility Evaluation. 

GRANT See Funding. 

HALO EFFECT The tendency of someone's reaction to 
part or all of a stimulus (e.g. a test, a student's answers to a 
test, someone's personality) to spill over into their reaction 
to other, especially adjacent, parts of the same stimulus. For 
example, judges of exams involving several essay answers 
will tend to grade the second answer by a particular student 
higher if they graded the first one high than they would if 
this had been the first answer they had read by this student 
(the error is often as much as a full grade). Halo effect is 
avoided by having judges assess all the first components 
before they look at any of the second components, and by 
concealing from them their grade on the first component 
when they come to evaluate the second one. The halo effect 
gets its name from the tendency to suppose that someone 
who is saintly in one kind of situation must be saintly (and 
perhaps also clever) in ail kinds of situations. But the halo 
effect also refers to the illicit transfer of a negative assess- 
ment. The Hartshorne & May work (Studies in Deceit, Col- 
umbia, 1928) suggests there is no good basis for this transfer 



71 



c 



across categories of immorality. 

HARD vs. SOFT (approaches to evaluation) Colloquial 
way to refer to the difference? between the quantitative/ 
testing / measurement / survey / experimental-design ap- 
proach to evaluation and the descriptive / observational / 
narrative / interview / ethnographic / participanrtobserver 
kind of approach. 

HAWTHORNE ; EFFECT The tendency of a group or 
person being investigated, or experimented on, i>r evalu- 
ated, to react positively or negatively to the facLthat they are 
being investigated/evaluated, and hence to^erform better 
(or worse) than they would in the absence of the investiga- 
* tion, thereby making it difficult to identify any effects due to 

the treatment itself. Not really the same as the placebo 
effect, i.e., the effect on the consumer of an enthusiastic 
service- provider that results simply from the evident belief 
in treatment power of provider or recipient, though the 
term is often used to cover both. The Hawthorne effect 
can — and was originally defined in'a situation where the 
results — occur without any belief in the merit of the treat- 
ment. See John Henry Effect. 

HEADROOM See Ceiling Effect. 

HIERARCHICAL SYSTEM See Two-Tier. 

HOLISTIC SCORING/GRADING/EVALUATING The 
allocation of a single score/grade/evaluation to the overall 
character or performance of an evaluand; by contrast with 
analytical scoring/grading/evaluating. The holistic/analy- 
tical distinction corresponds to the macro/micro distinc- 
tion in economics and the molar/molecular or gestalt/atom- 
istic distinction in psychology Global is another cognomen, 
if — as in the health field — "holistic" has other, confusing 
associations. i 

HYPERCOGNITIVE or TRANSCOGNITIVE The do- 
main beyond the supercognitive, which is the stratosphere 
of the cognitive; includes meditation and concentration 
skills; originality; the intellectual dimension of empathic 
insight (as evidenced in role-playing, acting, etc.); eidetic 
imaging; near-perfect objectivity, rationality, reasonable- 
ness, or "judgment" in the common parlance; itk""i1 sensi- 

ERLC 



tivity; ESP skills, etc. Some of this is incorrectly referred to 
as "affective education"; part of it belongs there; all of it 
deserves more attention. 

HYPOTHESIS TESTING The standard model of scien- 
tific research in the classical approach to the social sciences, 
in which a , hypothesis is formulated prior to the design of 
the experiment, the design is arranged so as to test its truth, 
and the results come out in terms of a probability estimate 
that the results were solely due to chance ("the null hypoth- 
esis"). If the probability is extremely low that only chance 
was at work, the design should make it inductively highly 
likely that the hypothesis being tested was correct. What is 
to count as the high degree of improbability that only 
chance was at work is usually taken to be either the .05 
"level of significant" (one chance in 20) or the .01 (one 
chance in a hundred) "level of significance." When dealing 
with phenomena whose existence is in doubt, a more ap- 
propriate level is .001; where the occurrence of the phenom- 
enon in this particular situation is all that is at stake, the 
conventional levels are more appropriate. The significance 
level is thus used as a crude index of the merit of a hypoth- 
esis, but is legitimate as such only to the extent that the 
design is bulletproof. Since evaluation is not hypothesis 
testing, little of this is of concern in evaluation, except in 
checking subsidiary hypotheses e.g. that the treatment 
caused certain outcomes. 

An important distinction in hypothesis testing that car- 
ries over the evaluation context in a useful way is the 
distinction between Type 1 and Type 2 errors. A Type 1 
error is involved when we conclude that the null hypothesis 
is false although it isn't; a Type 2 error is involved when we 
conclude that the null hypothesis is true when in fact it's 
false. Using a .05 significance level means that in about b% 
of the cases studied, we will make a Type 1 error. As we 
tighten upon our level of significance, we reduce the chance 
of Type 1 error, but correspondingly increase the chance of a 
Type 2 error (and vice versa). It is a key part of evaluation to 
look carefully at the relative costs of Type 1 and Type 2 
errors. (In evaluation, of course, the conclusion is about 
merit rather than truth.) A metaevaluation should carefully 
spell out the costs of the two kinds of error, and scrutinize 
the evaluation for its failure or success in taking account of 

ERJC 82 



these in the analysis, synthesis, and recording phases. For 
example, in quality control procedures in drvg manufacture 
(a type of evaluation), it may be fatal to a prospective user to 
identify a drug sample as satisfactory when in fact it is not; 
on the other hand, identifying it as unsatisfactory when it is 
really satisfactory will only cost the manufacturer whate-ir 
that sample costs the rtianufacturer to make Hence \i is 
obviously in the interest o f *he public and the manufacturer 
(given the possibility of damage suits) to set up a system 
which minimizes the chance of false acceptances, even at 
the expense of a ratner 'nigh level of false rejections Because 
of the totally non-mnemonic characteristics of the terms 
"Type 1" and " Type 2," it's always better to use terms like 
incorrect acceptance'' a^d "incorrect rejection" and make 
the referant the evaluand, lather than the null hyfvthesis, the 
lattci concept being hkeiv to prove unenlightening to most 
audiences 

ILLUMINATIVE EVALUATION (Parlett and Hamilton) 
A tvpe of pure process evaluation, very heavy on multi- 
perspective description and interpersonal relations, very- 
light on justified tough standards, very easy on value- 
phobes and very well defended in Beyond the Numb"? 
Game (MacMillan 1^77) Congenial to responsive evalua- 
tion supporters, not unlike perspectivai evaluation except 
more relativistic 

IMPACT EVALUATION An evaluation focussed on 
outcomes or pav -off rather than process delivery or im- 
plementation evaluation 

IMPACTED POPULATION The population that is 
crucial in e\ aluation, bv conh ast with the target population 
and even the true consumers. See Recoil Effects* 

IMPLEMENTATION EVALUATION Recent reactions 
to the generally unexciting results of impact evaluations on 
social action programs have included a shift cu ir.ere moni- 
toring ot program delivery i e implementation evaluation. 
\ou can easih implement, it's harder to improve 

IMPLEMENTATION Or EVALUATIONS The fre- 
quent complaint fbv evahiators) that evaluations have little 
effect, i e are not implemented, refers to four quite dif- 
ferent situations (a) Manv evaluations ire simply mcompe- 



tent and it's most desirable they not be implemented; 
(b) Some evaluations rr ake— and should make— no imme- 
diate recommendations (j?.g. accountability evaluations); 
nevertheless they have a powerful preventive effect and 
some cumulative long-run effect, but neither is readily 
insurable; (c) Many evaluations are commissioned in such 
a way that even when done as /ell as possible they will not 
be of any use because they were set up so as to be irrelevant 
to the real issues that affect the decision-maker, or are so 
under-funded that no sound answer can be obtained— 
again, it 's just as well these not be implemented; (d) Some 
excellent evaluations are ignored because the decision- 
maker doesn't lik* (e.g. is threatened by) the results or 
won't take on the nsks or trouble of implementation. The 
"lack of implementation" \ ' <?nomenon thus has little or 
large implications for the t..,d of evaluation, depending 
entirely on the distribution of the causes across these four 
categories. It is hardly something to be unduly concerned 
about professionally as long as evaluation still has a long 
way io go in doing its own joL well; doctors shouldn't worry 
that their patients ignore their advice if a gre? jeal of it is 
bad. But as a citizen one can scarcely avoid worry about the 
colossal wastage resulting from the fourth kind of situation; 
here's a fairly typical quote from the 8/1/80 G AO reports on 
their (usually very good) evaluations: The Congress has an 
excellent opportunity to save billions of dollars by limiting 
the number of noncombat aircraft to those that can be ade- 
quately justified . Dept. of Defense justifications [were] 
based on unrealistic data and without adequate consid- 
eration of more economical alternatives." GAO has been 
issuing reports on this topic since 1976 without any effect so 
tar See Risk Evaluation. 

IMPLEMENTATION OF TREATMENT The degree 
to which a treatment has been instantiated in a particular 
situation, typically a field trial of the treatment or an experi- 
mental investigation of it. The notion of an "index of im- 
plementation " consisting cf a set c 5 scales describing the 
key features of the treatment, and allowing one to measure J 
the extent to which it is manifested in each dimension, is a 
useful one for checking on implementation, an absolutely 
fundamental check if we are to find out whether the treat- 



" 84 




ment has merit This is part of the "purely descriptive" 
effort in evaluation, and is handled under the description 
checkpoint and the process checkpoint of the Key Evalua- 
tion Checklist. If the description checkpoint provides a 
correct account of the treatment that is supposed to be 
implemented, and the process checkpoint provides a cor- 
rect description of what is actually occurring, the match 
between the two is a measure of the implementation — and 
hence of the extent to which we can generalize from the 
results of the test to an evaluation of the ev- 'uand which we 
are supposed to be evaluating. 

IMPROVEMENT, EVALUATION FOR See Forma- 
tive Evaluation. 

INCESTUOUS RELATIONS (in evaluation) Refers to 
(a) extreme conflict of interest (where the evaluator is "in 
bed with" the program being evaluated), as is typical of 
ordinary program monitoring by agencies and foundations 
where the monitor is usually the godfather (sic) of the pro 
gram, sometimes its inventor and nearlv always its advo- 
cate at the agency, and a co-a .thor of its modifications as 
well as— supposedly'— its evaluator, (b) incestuous valida- 
tion of test items occurs .hen they are selected/ rejected on 
the basis of the correlation of performance on that item with 
overall score on *he test. Manv widely-used tests have low- 
ered thor construct validity by dumping lace- valid items 
because of this The correct procedure is f > check for other 
errors (e.g. irrelevance, ambiguity) perhaps by external 
judge review or rewriting the item(s), hoping the correla- 
tion won't be high — because then you have tapped to an 
independent dimension of criterion performance 

INCREMENTAL NEED An unmet or add-on need. 
Cf maintenance or met need, 

INDEPENDENCE Independence is onlv a relative no- 
tion, but bv increasing it, we can decrease certain types of 
bias Thus, the external evaluator is somewhat more inde- 
pendent than the internal, the consulting medical specialist 
can provide a more "independent >ptnion" than the familv 
phvsician, and so on But of course both mav share certain 
biases, and there is always the particular bias that the exter- 
nal or "second opinion" is tvpicaliv hired bv the internal 



one and is thus dependent upon the latter for this or later 
fees, a not inconsiderable source of bias. The more subtle 
social connections between members of the same profes- 
sion, e.g. evaluators, are an ample basis for suspicion about 
the true independence of the second or meta-evalua tor's 
opinion. The best approach is typically to use more than one 
"second of \ ion" and to sample as widely as possible in 
selecting these other evaluators, hoping from an inspection 
of their (independently written) reports to obtain a sense of 
the variation within the field, from which one can extrapo- 
late to an estimate of probable errors. 

INDEPENDENT VARIABLE See Dependent Variable. 

INDFCATOR A factor, variable, or observation that is 
empirically or definitionally connected with the criterion; a 
correlate. For example, the judgment by students that a 
course has been valuable to them for pre-professional train- 
ing is a (weak) indicator of that value. Criteria, by contrast, 
are, or are definitionally connected with, the "criterion" (real 
pay-off) variable. Indicators thus include but are not limited 
to criteria. Constructed indicators (or "indexes") are vari- 
ables designed to reflect e.g. the health of the economy (a 
social indicator) or the effectiveness of a program They, like 
course grades, are examples of the frequent need for concise 
evaluations even at the cost of some accuracy and reliability. 
Indicators, unlike criteria, have very fragile validity and can 
often be easily manipulated. 

INFERENTIAL STATISTICS Th?t part concerned with 
making inferences from characteristics of samples to char- 
acteristics of Ihe population from which the sample comes, 
which of course can only be done with a certain degree of 
probability (cf. Descriptive Statistics). Significance tests 
and confidence intervals are devices for indicating the de- 
gree of risk involved in the inference (or "estimate") — but 
they only cover some dimensions of the risk. For example, 
they cannot measure the risk due to the presence of unusual 
and possibly relevant circumstances such as freakish 
weather, an incipient gas shortage, ESP, etc. Judgment thus 
enters into the final determination of the probability of the 
inferred condition. See External Validity for the distinction 
between the inference in inferential statistics and in gener- 
alization, dr other plausible inference 

->•■ 77 

s 8G 



INITIATION-JUSTIFICATION BIAS (or "Boot Camp is 
Beautiful") The tendency to argue that unpleasant experi- 
ences one went through oneself are good for others (and 
oneself). A kev source of bias in the use of alumni interviews 
for program evaluation. See Consonance/Dissonance. 

INFORMAL LOGIC Several evaluation theorists con- 
sider evaluation to be in some respects or ways a kind of 
pel suasion or argumentation (notably Emest House, in 
Evaluating with Validity, Sage, 1980). In terms of this view, 
it is relevant that there are new movements in logic, law and 
science which give more play to what have previously been 
dismissed as "merely psychological" factors e.g. feelings, 
jf understanding, plausibility, credibility The "informal logic 
movement" parallels that of the New Rhetoric and natural- 
istic methodology in the social sciences Ref . Informal Logic 
ed. Johnson and Blair, Edgepress, 1980. 

INFORMED CONSENT The state which one tries to 
achieve in conscious, rational adults as a good start toward 
discharging on^'s ethical obligations towards human sub- 
jects. The tough cases involve semi-rational semi-conscious 
semi-adults, and semi-comprehension 

INPUT EVALUATION' Usually refers ti the unde -sir- 
able practice of using quality of ingredients as an index of 
quality of output (or of the evaluand) e.g. proportion of 
Ph.D.s on a college tacultv as an index of merit. It has a 
different and legitimate use in the CIPP model. 

INSTITUTIONAL EVALUATION A complex evalua- 
tion, tvpicallv involving the evaluation of a set of program^ 
provided bv an institution plus an evaluation of the overall 
management, pubhtuv, personnel policies and so on of the 
institution The accreditation V>f schools and colleges is *s- 
sentiallv institutional evaluation, though a very pour ex- 
ample of it One of the key problems with institutional 
evaluation is whether to evaluate in terms of the mission of 
the institution or on some* Ssolute basis. It seems obviously 
unfair to evaluate an institution against goals that it isn't 
trying to achieve, on theothe* hand, the mission statement* 
are usuallv mostly rhetoric an 1 virtually unusable for gener- 
ating criteria of merit, and thev are at least potentially sub- 
ject to criticism e g because of mappropnateness to need of 

ERIC y < 



clientele, internal inconsistencies, impracticality with re- 
spect to the available resources, ethical impropriety, etc. So 
one must in fact evaluate the goals and the performance 
relative to these goals or do goal-free evaluation. Institu- 
tional evaluation always involves rpore than the sum of the 
component evaluations; for example, a maw defect in most 
universities is departmental dominance, w.th the attendant 
costs in rigidifying career tracks, virtually eliminating the 
role-model of the generalise blocking new disciplines or 
programs— and preserving outdated ones— since in 
steady-state new ones have to come out of the old depart- 
ments' budget, etc. Most evaluations of schools and col- 
leges fail to consider these system features, which may be 
more important than any components. 



(especially standardized) paper-and-pencil tests, and a per- 
son used to estimate e.g. quality of handwriting. See Cali- 
bration, Measurement. 

INTERACTIVE (evaluation) One in which the eval- 
uees have the opportunity to react to the content of a first 
draft of an evaluative ref>prt, which is reworked in the light 
of any valid criticisms or additions. A desirable approach 
whenever feasible, as long as the evaluator has Jhe courage 
to make the appropriate criticisms and stick to them despite 
hostile and defensive responses— unless they are dis- 
proved. Very few have, as one can see by looking at site- 
visit or personnel reports that are not confidential, by com- 
parison with those that are, e.g. verbal supplements by the 
site visitors. See Balance of Power. 

INTERACTION Two factoi or variables interact jf the 
effect of one, on the phenomenon being studied, depends 
on the magnitude of the other. For example, math educa- 
tion interacts with age, being more or less effective on 
children depending on their age; and it interacts with matf 
achievement. There are plenty of interactions between van- 
abies governing human feelings, thought and behavior bui , 
they are extremelv difficult to pin down with any precision. 
The classic example is the search for aptitude-treatment or 
trait-treatment interactions in education; everyone knows 
from their own experience that they learn more from certain 
teaching styles than from others, and that other people do 




Covers not only calipers etc. but aisc 



er|c 




not respondiavorably to the same styles Hence there's an 
interaction between the teaching style (treatment) and the 
learning stymie (aptitude) with regard to learning. But, de- 
spite all our technical armamentarium of tests and measur- 
ing instruments, we have virtually no solid results as to the 
size or even the circumstances under which these ATI's 4 
occur, (Ref: The Aptitude-Achievement Distinction, ed. 
D. R. Green, McGraw Hill, 1974.) 

INTERNAL Internal evaluators (or evaluations) are 
(done by) project staff, even if they are special evaluation 
staff, i.e!, even if they are external to the production/writing/ 
teaching/ service part of the project. Usually, internal evalu- 
atK)n is part of the formative evaluation effort, but long term 
• projects have often had special summative evaluators on 
their staff, despite the low credibility (and probably low 
i ' validity) that results. Internal/externai is really a difference 
^ of degree rather than kind; see Independence. 

INTERNAL VALIDITY The kind of validity of an eval- 
uation or experimental design that answers the question: 
"Does the design prove what it's supposed to prove about 
the treatment on the subject* actually studied?" (cf. External 
Validity). In particular, does it prove that, the treatment 
prrduced the effect in the experimental subjects? Relates to 
the EFFECTS checkpoint in the Key Evaluation Checklist, 
Common threats to internal validity include poor in- 
struments, participant maturation, spontaneous change, or 
assignment bias. (Ref. Experimental and Quasi-Experi- 
mental Designs for Research, D T. Campbell and J.C Stan- 
ley, Rind McNally & Co , Chicago, 1972.) 

INTERLOCULAR DIFFERENCES Fred ivfosteller, <the 
great practical statistician, is fond of saying that he's not 
interested in statistically significant differences, but only in 
interoeular ones— those that hit vuu between the eyes. (Or 
that's what people are fond of saying he's fond of saying.) 

INTERPOLATE Infer to conclusions about values of the 
variables within the range sampled. Cf. Extrapolate. 

INTERRUPTED TIME SERIES A type of quasi- 
experimental design in which the treatment is applied and 
then withheld in a certain temporal pattern, to the same 
MifotiN The somewhat ambiguous term "self controlled" 

ERJC „„ . * 



used to be used for such case*, since the control group is the 
same as the experimental group. The simplest version is of 
course the "aspirin for a headache" design; if the headache 
goes away, we credit the aspirin. On the other hand, "psy- 
chotherapy for a neurosis" provides a weak inference 
because the length of the treatment is so great and spontan- 
eous recovery rates aje so high that the chance of the 
neurosis ending during that interval for other reasons than 
the psychotherapy is very significant. (Hence short-term 
psychotherapy is a better bet, ceteris paribus ) The next 
fancier self-controlled design is the so-called "ABBA" de- 
sign, where A is the treatment, B the absence of it— or 
another treatment. Measurements are made at the begin- 
ning of each labeled period and at the end. Here we may be 
able to control for the spontaneous remission possibility 
and sundry interaction effects. This is quite a good design 
for experiments on supportive or incremental treatments, 
e.g. vye teach 50 words of vocabulary by method A, then 50 
more by method B— and to eliminate the possibility that B 
only works when it follows A, we now reverse the order, 
and apply it first, and then A. Obviously more sophisticated 
approaches are possible by using curve-fitting to extrapo- 
late (or interpolate) to an expectable future (or past) level 
and compare that with the actual level. The classic fallacy in 
this area is probably that of the Governor of Connecticut 
who introduced automatic license suspension for the first 
speeding violation and got a very large reduction in the 
highway fatality rate immediately, aboutwhich hecroweda 
good deal. But a look at the vanability of the fatality rate in 
previous years would have made a statistician nervous, and 
sure enough, it soo.t swung up again in its fairly random 
way, (Ref Interrupted Time Series Designs, Glass, et al , 
University of Colorado. ) 

JOB ANALYSIS A breakdown of a job into functional 
components, often necessary in order to provide remedial 
recommendations and a framework for micro-evaluation or 
needs assessment Job analysis is a highly skilled task, 
which, like computer programming, is usually done badly 
by those hired to do it because of the failure of the pay scale 
to reflect the pay-offs from doing it well 

JOHN HENRY EFFECT (Gary Saretsky's term) The 



correlative effect to, or in an extended sense a special case 
of, the Hawthorne effect, i.e., the tendency of the control 
group to behave differently just because of the realization 
that thev are the control group. For example, a control group 
of teachers using the traditional math program that is being 
run against an experimental program may — upon realizing 
that the honor of defending tradition lies upon them— 
perform much better during the penod of the investigation 
than they would have otherwise, thus yielding an artificial 
result. One cannot df course assume that the Hawthorne 
effect (on the experimental group) cancels out the john 
Henry effect. 

JUDGMENT It is not accidental that the term "value 
judgment" erroneously came to be thought of as the para- 
digm of evaluative claims; judgment is a very common part 
of evaluation, as it is of all serious scientific inference. (The 
absurdity of supposing that "value judgments" could have 
no validity, unlike all other judgments, was an additional 
and gratuitous error ) The function of the discipline of eval- 
uation can be seen as largely a matter of reducing the ele- 
ment of judgment in evaluation, or reducing the element of 
arbitrariness in ihe necessary judgrrfents e.g. by reducing 
the sources of bias in the judges by using double-blind 
designs, teams, parallel teams, convergence sessions, cali- 
bration training etc. The most important fact about judg- 
ment is not that it isn't as objective as measurement (true) 
but that one can distinguish good judgment from bad judg- 
ment (and train good judges.) 

JUDICIAL OR JURISPRUDENTIAL MODEL (of evalu- 
ation) Wolff s preferred term and a term sometimes used 
for his version or, rather, extension of advocate-adversary 
evaluation He emphasizes that the law as i metaphor for 
evaluation involves much more than an adversarial de- 
bate — it also includes the fact-finding phase, cross- 
examination, evidentiarv and procedural rules, etc It in- 
volves a kind of inquiry process that is markedly different 
from the social scientific one, one that in several ways is 
tailored to needs more like those of evaluation (the action- 
related decision, the obligatory simplifications because of 
time, budget and audience limitations, the dependent on a 
particular jud^e and Jury, the fate of individuals at stake, 



etc.). Wolff sees the educational role of the judidal process 
(teaching the jury the rules of just inquiry) as a key feature of 
the judicial model and it is certainly a strong analogy with 
evaluation. 

JURY TRIAL Used in TA and evaluation. See preced- 
ing entry. 

KEY EVALUATION CHECKLIST (KEC) What fol- 
lows is not intended to be a full explanation of the key 
evaluation checklist and its application, something which 
would be more appropriate for a text on evaluation. It sim- 
ply serves to identify the many dimensions that must be 
explored prior to the final synthesis in an evaluation. All are 
usually very important. A few words are given to indicate 
the sense in which each of the headings is intended, the 
headings themselves being kept very short in order to make 
them usable as mnemonics; some are expanded elsewhere 
in the Thesaurus. Many iterations of the KEC are involved 
in a typical evaluation, which is a process of successive 
approximation. (If the evaluation is to be goal-free, at least 
the field personnel will not follow the given sequence.) 

The purpose of exhibiting the KEC here is partly t<J make 
the point that evaluation is an extremely complicated disci- 
pline, what one might call a multi-discipline. It cannot be 
seen as a straightforward application of standard methods 
tn the traditional social science repertoire. In fact only seven 
of the fifteen checkpoints are seriously addressed in that 
traditional repertoire, and in most cases not very well ad- 
dressed as far as evaluation needs are concerned. 

1- DESCRIPTION. What is to be evaluated? The evalu- 
ivui, described as neutrally as possible. Does it have com- 
ponents? What are their relationships? It's useful to di- 
vide description into four parts: \ \ 1) the nature and op- 
eration, (1 2) the function, (1.3) the delivery system and 
(1 4) the support system. These are not sharply distinct, 
nor are any of them sharply distinct from effects and 
process What does it do— What is its function? How does 
it do it e.g. what is the delivery system that connects it with 
consumers? How does it continue to do it— what is the 
support si^tem. this includes the maintenance/ service/ 
update sv %tem, the instruction/ training svstem for users, 
the monitonng svstem (if anv) for checking on proper 



ERIC 



use/rtiaintenance etc. (Monitoring, like maintenance etc. 
may be done by someone wholly other than the vendor/ 
manufacturer/service provider.) When the evaluation 
process begins, the evaluatqr usually has only the client's 
descriptions of the evaluand and its functions etc. As the 
evaluation proceeds, descriptions from staff, users, n\on- 
itors, etc. will be gathered, and direct observation/tests/ 
measurements will be made. Eventually only the most 
accurate description will survive under this checkpoint, 
but the others are most important and will be retained — ' 
under checkpoint 3 (Background and Context)— heca use 
they provide important cues as to what problems (e.g. 
misperceptions, inconsistent perceptions) should be ad- 
dressed in the evaluation and in the final report. 4 

2. CLIENT. Who is commissioning the evaluation? The 
client for the evaluation, who may or may not be the 
initiator of the request for the evaluation; and may or may 
not be the instigator of the evaluand, e.g. its manufacturer 
or funding agencv or legislative godparent; and may or 
may not be its inventor e.g. designer of a product or 
program The client's wishes are crucial but not com- 
pletely paramount in determining the focus of the evalua- 
tion. As with any professional activity, the obligations of 
professional ethics occasionally supervene e.g. one must 
always check (at least bneflv) for bad side-effects in pro- 
gram evaluation even if the client thinks that a report on 
goal achievement is all that evaluation requires. 

3. BACKGROUND & CONTEXT of (a) the evaluand 
and (b) the evaluation. Includes identification of stake- 
holder (such as the non-clients listed in 2, the monitor, 
community representatives, etc.); intended function and 

4 supposed nature of the evaluand, believed performance; 
expectations from the evaluation, desired type of evalua- 
tion (formative vs summative vs ritualistic, holistic vs 
analytical), reporting system, organizatior charts, history 
of project, prior evaluation efforts, etc. 

4. RESOURCES (Sometimes called the "Strengths As- 
sessment" by contrast with the needs assessment of 
checkpoint 6) (a) available to or for use of the evaluand; 
(b) available to or for use of the evaluators These are not 
what i$ used up, meg purchase or maintenance, but 
what could be They include money, expertise, past ex- 




perience, technology, and flexibility considerations. 
These define the range of feasibility and hence delimit the 
investigation and criticism. 

5. CONSUMER. Who is using or receiving the effects of 
the evaluand? It is often useful to distinguish the group 
that needs the evaluand from those who get it or to whom 
it could be delivered (the market). It may even be useful to 
distinguish targeted populations of consumers — intended 
market — from actually and potentially directly impacted 
populations of consumers — the "true market, " or custom- 
ers, or recipients, or clients for the evaluand (often called 
the clientele). These should be distinguished from the total 
directly or indirectly impacted recipient population 
which makes up the "true consumers." Note that the in- 
stigator and others (see 2 and 3) are also impacted, e.g. by 
having a job, but this does not make them consumers, in 
the usual sense. We should, however, consider them 
when looking at total effects and can describe them as 
part of the total affected, ima&cted or involved group— 
the provider population. (Taxpayers are usually part of 
this population.) Recipients + providers = impactees. 

6. VALUES. Sometimes called the "Needs Assess- 
ment" of the impacted and potentially impacted popula- 
tions. But it must look at wants as well as needs; and also 
values such as judged or believed standards of merit and 
ideals; the defined goals of the program where a goal- 
based evaluation is undertaken; any validated standards 
that apply to the field; and the needs etc. of the instigator, 
monitor, inventor etc., since they are indirectly impacted. 
The relative importance of these often conflicting consid- 
erations will depend upon ethical and functional consid- 
erations. It is from this checkpoint alone that one gets the 
value component in the evaluation; the values may apply 
to either process or outcome. 

7. PROCESS. What constraints and values apply to, 
and what conclusions can we draw about, the normal 
operation of the evaluand (as opposed to its effects or 

1 OUTCOMES (8))? In particular, legal / ethical-moral / 
political / managerial / aesthetic / hedonic / scientific 
constraints? With this checkpoint we begin to draw eval- 
uative conclusions. The ones here are the most immediate 
consequences i i all, because they involve what are some- 



s') 

34 



times called intrinsic values (trjuth, beauty, ethics) We 
call it the process checkpoint partly for reasons of con- 
tinuity with tradition — because in program/service eval- 
uation, it's the process that we iook at foV intrinsic evalua- 
tion But in product evaluation the same checkpoint 
applies — for example, we would look at the scientific 
accuracy of the contents of a textbook^JTTd^r this heading. 
Given a basic description of the process from the first 
checkpoint and the values fron/checkpoint 6, we can 
sometimes draw an immediate evaluative conclusion e.g. 
"violates safety sfendard X. " More commonly we have to 
do some further investigation of the process in order to 
see if the relevant standards are upheld. There are four 
other reasons for looking at process — to see if what's 
happening is what's supposed to be happening (the de- 
gree of implementation issu#), to get clues about cassa- 
tion that will not appear frorifa bl«rtk box approach (but 
which we may need for determining long-term out- 
comes), to spot immediate or very fast outcomes, and to 
look for indicators that are known to be correlated with 
certain long-term outcomes (whose emergence we may 
not have time to await). The first of these eventually 
results in corrections to DESCRIPTION; the others feed 
into OUTCOMES. Thus, some "process phenomena" are 
effects of the program e.g. enjoyment (instant outcomes), 
some are part of it, and some are part of the context / 
environment / svstem containing it. One managerial proc- 
ess constraint of special significance concerns the "degree 
of implementation," i.e , the extent to which the actual 
operation matches the program stipulations or sponsor's 
beliefs about its operation One scientific process consid- 
eration would be the use of scientifically validated proc- 
ess indicators of eventual outcomes; anotHer would be 
the use of scientifically (historically etc ) sound material in 
a textbook/course One ethical issue would involve the 
relative weighting of the importance of meeting the needs 
of needy target population people and the career or status 
needs of other impacted- population people e g. the pro- 
gram staff 

8 OUTCOMES What effects (long-term outcomes or 
concurrent effects) are produced by the evaluand 7 (In- 
tended or unintended) A matrix of effects is useful to get 

ERIC 95 * 



one started on the search; population affected x type of 
effect (cognitive/ affective/ psychomotor/ health/ social/ 
environmental) x size of -each x time of onset (immedi- 
ate/ end of "treatment' /later) x duration x each compo- 
nent or dimension (if analytical evaluation is requned). 
For some purposes, the intended effects should be separ- 
ated from the unintended (e.g. program monitoring, 
legal accountability), for others, the distinction should 
not be made (consumer-oriented summative product 
^ valuation). 

9 GENERALIZABILITY (or potential or versatility) to 
other people, places/times/ versions. ("People" means 
staff as well as recipients ) These can be connected with 
Deliverability/ Saieabihty/ Exportahility/ Durability'/ 
Modifiability 

10. COSTS Dollar vs Psychological vs Personnel vs. 
TirrtP Initial vs Recurrent (including Preparation-Main- 
tenance-Improvement), Direct/Indirect vs Immediate/ 
Delayed Discounted, bv components if appropriate 
*■ 11 COMPARISON'S with alternative options— include 
options recognized and unrecognized, those now avail- 
able and those constructahle— the leading-contenders in 
this field are the "'critical competitors" anc^are identified 
on cost plus effectiveness plus everything else grounds 
Thev normally include those that produce similar or bet- 
ter effects for less cost and better effects tor a manageable 
(RESOURCES) extra cOst 

12. SIGNIFICANCE. A synthesis of all the above The 
validation of the synthesizing procedure is often one of 
the most difficult tasks in evaluation. It cannot normally 
he left to the client who is usually ill-equipped bve\pe\ 
e nee or objects ity tt* do it, and the formula approaches ot 
e g cost-beneht calculations are only rarely adequate 
"f lexible wvighted sum with overrides" is often useful 
See Weight and Sum. 

!3 RECOMMENDATIONS Ihe^e nav or mav not be 
requested, and mav or mav not follow from the evalua- 
tion, v s ven it rtv^uostt <! it mav not be feasible to provide 
anv, betauM 1 the or.h type that would be appropriate are 
nut s t ich that any scientific evideme tor specific ones is 
available in the relevant held n lesearch (RESOURCFS 
a\ailahle tor the evaluation tirtuntcial hen 1 ) 1 

ERIC y " 



\ 14. REPORT. Vocabulary, length, format, medium, 
time, location, and personnel for the presentation(s)ne'ed 
careful scrutiny as dpes proiection/pnvacy/publicity and 
-prior screening or circulation of final and preliminary 
drafts. 0 , 

15. METAEVALUATION. The evaluation rr.u^be evalu- 
ated, preferably prior to final dissemination of report, and 
certainly prior to implementation. External evaluation is ^ 
desirable^ but first the primary evaluator should apply the 
Key Fvaluatipn Checklist to the evaluation itself. Results 
of the metaevaluation should be used formatively but 
may also be incorporated in the report or otherwise con- 
veyed (summStively) to the client and other appropriate 
r audiences ("Audiences" emerge at metacheckpoint 5, 
since they are the "Market^nd /'Consumers" of the 
evaluation.) 

KILL THE MESSENGER (phenomenon) The ten- 
dency to punish the bearer of bad tidings. One aspect of 
valuephobia. Much of the current attack on testing e.g. 
minimum competency testing & pure KTM, like many of the 
elaborately rationalized earlier afiacks on course grades. 
The presence of the rationalizations identify these as ex- 
amples of a sub-species; Kill the Messenger— After a Fair, 
Tnal, Of Course. 

LAISSEZ PAIRE (evaluation) "Let the facts, speak for 
. themselves." But do they? Whaf do they say? Do they say « 
the same thing to different listeners? Once in a while this 
approach is justified, but usually it's simply a cop-out, a 
refusal to do the hard professional task of synthesis and its 
justification. The laissez-faire approach is attractive to ■ 
valuephobes— and to anyone else when y the results are 
going to be controversial, the major risk in the naturalistic/ 
responsive/ illuminative approach is sliding into laissez- 
faire evaluation, i.e.— to put it slightly tendentiously— no 
evaluation at all. 

LEARNER VERIFICATION A phrase of Ken Komos- 
ki's, president of EPIE; refers to the process of (a) establish- 
ing that educational products actually work with the in- 
tended audience, and (b) systematically improving them in 
the light qf the results of field tests. Now required by law in 
# e.g. Florida and being considered for that status elsewhere. 

ERIC q- 88 



The first response of publishers was to submft letters from 
teachers testifying that the materials worked. This is not the 
R&D process that the term refers to. Some of the early 
programmed texts were good examples of learner verifica- 
tion. Of course, it's costly, but so are four-color plates and 
glossy paper. It simply represents the application to educa- 
tional products of the procedures of quality control and 
development without which other consumer goods are il- 
legal or dysfunctional or suboptjmal. 

LEVEL OF EFFORT Level of effort is normally specified 
in terms of person-years of work, but on a small project 
might be specified in terms of person-months. It refers to 
the amount of direct "labor" thart w ill be required, and it is 
presumed that the labor will be of the appropriate profes- 
sional level; subsidiary help such as clerical and janitorial is 
either budgeted independently or regarded as part of the 
support cost, that is, included in a professional person-year 
of work. Person-years (originally man-years) is the normal 
unit for specifying level of effort RFPs will ofteginot de- 

- scribe the maximum surpfin dollars that is countenanced for 
the proposal, but may instead specify it in fexmTor^Tsoh- 
years. Various translations of a person-year unit into dollars 
are uied; this will depend on the agency, the level of profes- 
sionalism required, whether or not overhead and clerical 
support is separately specified, etc. Figures from $30,000 to 
over $50,000 per person-year are used at times. 

LICENSING (of evaluators) See Evaluation Registry. 

LITERARY CRITICISM The evaluation of works of lit- 
erature; in many ways an illuminating model for evalua- 
tion — a/good corrective for the emphases of the social 
science model. Varibus attempts have been made to • 
. "tighten up" lkerary criticism, of which the New Criticism 
movement is perhaps "the best known, but they all involve 

- rather blatant and unjustified preferences of their own (i.e./ 
biases), exactly whatthey were alleged to avoid. The time 
may be ripe to try again, using what we now know about 
sensory evaluation — and perhaps responsive and illumin- 
ative evaluation — to remind us of how to objectify the ob- 
jectifiable while clarifying the essentially subjective. Con- 
versely, a good deal can be learnt from a study of the efforts - 

> ft ,Qn 



of RR. Leavis (the doyen of the New Critics) and T.S. Eliot 
£Tn histaritical essays to precisify and objectify criticism. His 
"view that "comparison and analysis ace the chief tools 6f t^ie 
• critic" (Eliot 1932), and even more his practiceof displaying 
very specific and carefuiry chosen passages to make points 
vvould find favor with the responsive ^valuators (and 
others) today. Ezra Poufid and Leavis went even further 
ttiwards exhibiting the concrete instance (rather than the 
general principle) to make a point.. This idiographic, anti- 
nomothetic Approach is not, contrary tQjnuch popular phi- 
losophy of science, anti-scientific as such; but in practice it 
failed to avoid various style or process biases, and' toobften 
(e.g. with Empson) became precious at the expense of logic 
One can no more forget the logic of plot or the limits of 
possibility in fiction than the logic of function and the limits 
of respo^ibjlity in program evaluation. 

LOCAL EXPERT A local expert (used as an evaluator) 
is someone*frgm the same field as the program or person 
being evaluated e.g. a "health area evaluator," or — more 
specifically-'-a nursing program evaluator, or — even more 
commonly — "someone else from Texas in nursing educa- 
, tion" (but without evaluation expertise). The gains are in 
relevant specific knowledge; the losses are in shared bias 
and (usually) lack of knowledge of or experience with the 
* * more serious aspects of e.g. program evaluation as a disci- 
^ pline. If you're looking for a friendly evaluator, use a'local 
one^-and maybe fhey'llgreturn the favor one day, too. If 
you want objectivity/ validity always go for 1 the mix — one 
local, one general-purpose — and ask for separate reports If 
your budget is too small for the travel costs or fees of a 
nati.onal-level evaluator, just find a geographically local 
general-purpose evaluator or at the worst just an evaluator 
from another discipline — there are plenty around — to form 
r team with your/Mpcal" favorfte. 

LOCUS OF CONTROL Popular' "affective" variable, 
referring roughly to the location someone feels is appro- 
priate for the center of power in the universe on a scale from 
"inside me" to "far, far away " A typical item might ask 
about the extent to which the subject feds s/he controls 
their own destiny. In fact, this is often a simple test of 
knowledge abouf realitv^and not affective (depending on 



how much stress is put on the feeling part of the item), and 
where it is affective, the affect may Be judged as appropriate 
or inappropriate. So these items ^re usifcrfiy misinterpreted; 
e.g.* by taking' any movement towards internalization of 
locus of control as a gain, whereas it may be a sign of loss of 
contact with reality. 

LONGITUDINAL STUDY An investigation in which 
a particular individual of group of individuals is followed 
over a substantial period of time, in order to discover 
changes due tr tie influence of ap evaluand, or maturation, 
or environment. The contrast is witlr'a cross-sectional 
.study. Theoretically, a longitudinal study could also be in 
experimental study,' but none of those done on the effect of 
smoking on lung cancer are of this kind although the results 
are almost as solid. In the human services area, iHs very 
likely that longitudinal studies will be uncontrolled, cer- 
tainly not experimentally controlled. 

yDNGTERM EFFECTS In- many c^ses, it is important 
to examine the effects of the program or product aft^r an 
extende d pe riod of time; often this is the mo^ worthwhile 
criterion. Bureaucfectic,arrangements such as the difficulty 
of carrying funds over from one fiscal year to the next often 
make investigation of these effects virtually impossible. 
"Longitudinal studies" where one-group is "followed-up'* 
over a long period are more commoniy recognized as stan- 
dard procedure in the medical and drug areas; an important 
example in education is the PROJECT TALENT study, now 
in its third decade. See Overlearning. 

MACRO-EVALUATION. See Holistic Evaluation. 

MAINTENANCE NEED A met but continuing n^ed 

MAN- YEARS (properly, person-years) See Level of 
Effort. ^ % 

MARKET The market consideration in the Key Evalua- 
tion Checklist refers to the disseminability of the product or 
program. Many needed products, especially educational 
ones, are unsaleable by available means, tt is only possible 
to argue for developing such products, if there is a special, 
preferably tested plan for getting them used. No delivery 
system, no market. No market, no needs met. (It does not 



. follow* that the existence of a market implies needs met or* 
any other basis for worth . ) 

• MASSAGING (the data) Irreverent term for ^mostly) 
legitimate synthesis of the raw results. 

MASTERY LEVEL The level of performance actually 
+ needed on a criterion. Focus on mastery level training does 
not accept anything less, and doe^ not care about anything 
more. In fact, the "mastery level" is often arbitrary. Closely 
tied to competency-based approaches. Represents one ap- 
plication of criterion-referenced testing* 

„ MATCHING See Control Group. 

MATERIALS (evaluation) See Product Evaluation. 



MATRIX SAMPLING If you want to evaluate a new 
approach to preventive health care (or science*education), 
you do not have to give a comptete spectrum of tests (per- 
haps a total of ten) to all those allegedly affected, or even toa 
sample of them; you can perfectly well give one or two tests 
to each (or each in the sample), taking care that each test 
does get giYen to a random sub-sample, and preferably that 
it is randomly associated with each of the others, if they are 
administered pairwise (in order to reduce any bias due to 
interactions between tests)/This can yield (a) much less cosj 
to you than full testing of the whole sample, (b) less strain 
on each subject, (c) some contact with each subject by 
contrast with giving all tests to a smaller sample, (d) ensur- 
< ing that a U of a larger pool of items are used on some 
students. But — the trade-off — you will not be able to say 
muA about each individual. You are only evaluating the 9 
treatment's overall value. A good example of the importance 
of getting the evaluation question clear before doing a 
design. . 

MBO Management By pbjectives, i.e. state what you're 
trying to do in language that will make it possible to tell 
whether ^0u succeeded. Not bad as a guide to planning 
(though it tends to overrigidify the institution), but dis- 
astrous aS a model for evaluation (though acceptable as , 
one element in an evaluation design.) See Goal-Based 
Evaluation. 





.A ■ 

* MEAN (Slat.) (Cf. Median, Mode) The mean scorebn 
a* test is thtf obtained by adding all the scores and dividing* 
by the number of people taking it; one of the several exact 
senses of "average/' The mean is, however, heavily af- 
fected by the scores of the top and bottom few in the class, 
and can thus be non-representative o(jhe majority. 

MEASUREMENT Determination of the magnitude of a 
quantify; not necessarily, though typically, on a criterion- 
referenced test scale, e.g. feeler gauges, or on a continuous 

9 numerical scale. There are various types of measurement 
scale, in the loose sense-, ranging from ordinal (grading or 
ranking) to cardinal (numerical scoring). The standard sci- 
entific use refers to the latter only. Whatever is used to do 

/ the measurement, <apart — usually — fronj the experi- 
menter, is called the instrument. It may be a questionnaire 
or a test or an eye or a piece of apparatus. In certain contexts; 
" we treat the observer as t!. instrument needing calibration 
or validation. Measurement is a .common aqd sometimes 
large component of standardized evaluations, but a very 
small part of its logic, i.e. of the justification for the evalua- 
tive conclusions. 

MEDIAN (§tat.) (Cf. Mean, Mode) The median per- 
formance on a test is that score which diVidesthe group into 
two, as nearly as possible; the "middle" performance. It 
provides ene exact sense* for the ambiguous term "aver- 
age. " The "median is not affected at all by the performance of 
the fe\y students at th<? top and bottom of a dassfcf . Mean). 
On the other hand, as with the mean, it may be the case thaf 
no one scores at or near the median, so that it doesn't r 
identify a "most representative individual" in the way that 
the mode does. Scoring at the 50th percentile is (usually) 
the same as having the median score,, since 50% are below 
you and 50% above. 

MEDIATED EVALUATION A mot* precise tern for 
what is sometimes called (in a loose sense) process evalua- 
tion, meaning eve^dtion of something by looking at^co/v 
dary indicators of merit, e.fc. name of manufacturer, pro- 
portion of Ph.D.s on faculty, where spmeone went to col 1 
lege. XJ^term "process evaluation" also refers to the direct 
check on e.g. ethicality of process. See Key Evaluation 

* - 93 102- 



Checklist* • t 

MEDIATION (OR ARBITRATiON) model of evalua- 
tion. Little attention has been paid tojhe interesting social 
role anctekills of the rnediator or arbitrator, which in several 
ways provide amodel fortheevaluatore.g, the combination 
of distanringwith considerable dependence upcn reaching 
agreement, the role of logic and persuasion, of ingenuity 
arid empathy-. 

MEDICAL MODEL (of evaluation) In Sam Messick's 
version (in the Encyclopedia of Educational Evaluation) the 
contrast is drawn between the engineering model and the 
medical model' The engineering model "focuses upon 
input-output differences, frequently in relation to cost/" 
The medical model, on the other hand, (which Messick 
favors) proyides a considerably more complex analysis, en- 
ough to justify: tHe treatm£Rt's generalization into other 
field settings; remediation suggestions; and side effect pre- * 
dictions. The problem here is that we cross the boundaries 
between evaluation and general causal investigations, 
thereby diluting the distinctive features of evaluation and so 
expanding its scope as to make results extremely difficult to 
obtain. It seems more sensible to appreciate Consumer Re- 
ports for what it; gives us, rather than complain that jt fails to 
give us exp^nalion^ of the underlying mechanisms in the 
products and services that it rates, Cf. Holistic and Analytic 
Evaluation. * \ 

MERIT' (Cf. Worth, Success) "Intrinsic" value as bp- 
posed to extrinsic or system-based v^lue/worth. For ex- 
ample, the merit of researchers lies in their 4kill and origin- 
ality—their worth (to the institution that employs them) 
would include the income they generate. 

META-ANALYSIS (Gene ^lass) The name for a partic- 
ular approach Ao synthesing studies on*a common topic, 
involving the' calculation of a special parameter for each 
("Effect Size"). Its promise is to pick upsomething of value 
even from studies which do not meet tfye'usual "minimum 
standards"; its danger is what is referred toinjthe computer 
programming field as the GIGO Principle— Garbage In, 
Garbage Out. While* it is clear that a number of studies, 
none of which is statistically significant, can be integrated by 



ERJC . . .103 



94 



a meta-analyst into a highly significant result (because the 
combined N is larger), it is not clear how invalid designs can 
be, integrated. An excellent review of results and methods 
will be fQund in Evaluation in Education Volurfie 4, No. 1, 
1980, a special issue entitled ''Research Integration: the 
State of the Art". Meta-analysis is a special^ approach to 
what is called the general problem pf research (studies) 
> integration or research synthesis, and this array of terms for 
it reflects the fact that it an intellectual activity that lies 
between data synthesis on the one hand and the evaluation 
of research on tly other. As Light points out (ibid.) there is a 
residual element of judgment involved at several places in 
meta-analysis as in any research synthesis process; clarify-^ 
ing the ba$is for these judgments is a task for the evaluation 
methodologist, and (Glass' efforts to do so have led to the 
burgeoning of a very fruitful area of (meta-)research. It's an 
e>(ample of self-referent research in the same way as meta- 
evaluation. . * 

META-EVALUATION Meta-evaluation is the evalu- ' 
ation of evaluations, and hence typically involves using 
another evaluator to evaluate a proposed or completed ev- 
aluation. This practice puts the primary evaluator in a simi-* 
lar position to the evaluee; both are going to be. evaluated on* 
their performance. It can be. done formatively or summa- 
tively. Reports should go to the original client copy to the 
first-level evaluator for reaction. ,Meta-evaIuation then 
gives the client independent evidence about the technical 
competence of the primary evaluafor. No infinite regress is 
generated because extrapolation snows it doesn't pay after 
the first meta-level on most projects and the second o^any. 
Meta-evaluation is a professional obligation fore valuators, 
as psychoanalysis is for psychoanalysts. A dimensional ap- 
proach might consider the validity, the credibility, the util- 
ity (timeliness, readability, relevance), robustness and cost. 
The Key Evaluation Checklist can also be applied, in two 
ways: either by using it to generate a correct evaluation (or 
design), which can then be compared to the actual one 
(secondary evaluation), or by applying the checklist to the 
original evaluation as a product (true metaevaluation). The ' 
latter process includes the former as an appropriate scien- 
tific process consideration, but it also requires us to look at 



e.g. the cdst-effectivp ness ^Ufofc evaluation itself, and hence 
does something to assistive balance of power It should, for 
example, norrjwdS/include a look at the differential costs of 
Type l^ntfType 2 errors ir? the evaluation. Evaluations 
mpsf'not, however, be evaluated in terms of their actual 

^Consequences, but only in terms of their consequences tf 
s used appropriately. Besides the KEOone might use the vari- 
ous Evaluation Standards or Bob Gowin's»QUEM£C - 
approach. Professionalism' in evaluation requires regulation 
of the subject's self-reference and hence of the9bKgation t(| ^ 
true meta-evaluation. See also Consorjaitcer(Ref . Dan Stuf- 
• Tlebearh, "Meta£valuati&n: Concepts, Standards and Uses" 

* in Educational Evaluation Methodology: the State of the 
Art,ed. Ronald Berk Gohns-Hopkins, 1981). , 

MICRO-ANALYSIS This can either refer to: an evalua- 
tion which includes evaluations of the components of the 
evaluand; or to evaluation broken out by dimensions (see 
Analytical Evaluation); or to a causal explanation of the 
f (valued or disvalued) performance of the evaluand, which 
is not (Usually)^ concern of the ^valuator. See E xp la n ati o n. 

MICRO-EVALUATION Se? Analytical Evahutior . 

* MINIMAX The decision strategy of acting so as to mini- 
mize the maximum possible loss e.g. buying fire insurance 
simply because it eliminates the worst case of a'total unin- 
sured loss.*Contrasted with maximax (maximizing maxi- 
mum gain) e.g. enter the lottery with the largest prize, 
regardless of ticket price or number of tickets sold . These are 
said to be significant alternatives to optimization, which 
maximizes expectant? (the product of the utility of. an out* 
come by the probability that it will occur.) In fact, insofar as 
they can be justified at all, they are simply limit cases, 
applicable in lirhited special cases. Empirically, they are al- 
ternative strategics apd people use trterrfin all sorts of cases; 
but mmnahvely that is mainly a sign of irrationality or ignor- 
ance or tack of training. 

i * 

MINIMUM COMPETENCY TESTING A basic level 
of (usually) basid skills is a minimum competency. Success * 
in such tests has been tied to graduation, gr^de promotion, 
remedial education; failure h^s been tied to teacher evalua- 
tion, program non-funding etc. With all this at stake, MCT 



has been a very hot political issue— and an ethical one, and 
a measurement one. Introduced with due warning and 
•support it can represent a step towards honest schooling; 
done carelessly, it is a disaster. See Cutting Score. 

MISSION BUDGETING A generalization of the notion 
of program budgeting (see PPBS); the idea is to develop a 
system of budgeting which will answer questions of the 
, type, "How much are we spending on such and such a 
mission?" (by contrast with program, agency, and personnel- - 
the previous kinds of categories to which budget amounts 
were tied). One limitation of PPBS has been that a good 
many programs overlap in the clientele theyr serve and the 
services they deliver, so that we may have a very poor idea 
of how much we're putting fnto e.g. welfare or bilingual 
education^ by merely looking at agency budgets or even 
PPBS figures, unless we have an extremely clear picture—, 
which decision makers rarely can have, especially a new 
Executive Cabinet— of the actual impacted populations and 
the level of service delivery from each of the programs. This 
concept, along wiMi zero-based budgeting, was popular 
with the early Carter administration but we hear little about 
it later in that regime, just as MacNamara's introduction of 
PPBS (into DOD, from Ford Motor Company) under an 
earlier administration has faded considerably. 

w 

^ MODE (Stat.) (Cf. Mean, Median) The mode is the 
"most popular" (most frequent) score or score interval. It's 
more likely that a studeht about whom you kriow nothing 
except their membership in this group scored the "modal" 
score of the group than any other score. But it may not be 
very likely, e.g. if every student getsa different score, except 
two who get 100 out of 100, then the mode is 100, byt it's not 
very "typical." In a "normal" curve, on the other hand, like 
t the (alleged) distribution of 1Q scores in th^e U.S. popula- 
tion, the mean, the median, and the mode are all the same ' 
vahje corresponding to the highest point of the curve. Some 
distributions, or curves representing them, are described as 
bi-modal, etc., which means that there are tivo (or more) 
peaks or modes; {his is a looser sense of the term mode, but 
useful. 

MODELS (of evaluation) A term loosely used to refer 



" 106 



to a conception or approach or sometimes'even a methot! 
(naturalistic, goal-fret) of doing evaluation Models are to 
paradigms as hypotheses are to theories, which means less 
general and somepverlaps. Referenced here are the follow- 
ing, frequently referred to as models: advocate-adversary, 
black box, connoisseurship, CIPF, discrepancy, engineer- 
ing, judicial, medical, responsive, transactional and social 
science. The best classification of these and others (many 
have betu attemp^e#is Stufflebeam and Webster's (forth- 
coming, 1982) 

MODUS OPERANDI METHOD A procedure for 
identifving the cause of a certain effect by detailed analysis 
of the chain of events preceding it and of the ambient 
conditions: it is sometimes feasible when a control group is 
impossible, and it is useful as a check or strengthening of 
- 4 the design even when a control group is possible. The 
concept refers ,to the characteristic pattern of links in the 
causal ^hain which the criminalist refers to as the madui \ 
operandi of a criminal. Tftese can be quantified and even j 
configurallv scored, the problem of identifying the cause ( 
can thus be converted into a pattern-recognition task for a 
* computer. The strength of the approach is that it can be 
applied in individual cases, informally, semi-formally (as in 
*nnnnalistks), and formally (full computerization) It also 
leads to MOM-oriented design^ which deliberately employ 
"tracers" i.e. aitefactual features of a treatment whicfi wijl 
show up in the effects. An example would be the use of a 
particular sequence of items in a student questionnaire dis- 
seminated to faculty for instructional development use 
(Details in a section by this title in Evaluation in Education, 
ed J.R. Popham, McCutcheon, 1976 ) 

MONITOR -The term "monitor" was the crtfginal term 
for what is now often called by an agency ""the project 
officer," namely the person from the agency staff that is 
responsible for supervising progress and complianceon a 
particular contract or grant. "Monitor" was a much clearer 
term, since "project officer" could equally well refer to 
somebody whose responsibilities were to the project man- 
ager, or* to somebodv who merely handled the contract 
paper work (the "contract officer," as the fiscal agent kt the 
agency is sometimes called) But it was apparently thought 



98 

107 



to|iave "Big Brother" connotations, or not to reflect ade- 
quately the full range of responsibilities, etc See Moni- 
toring. 

MONITORING A monitor (of a project) is usually a 
representative of the funding agency whgf' watches for 
proper use of funds, observes progress, provides informa- 
tion to the agency about the project and vice versa Monitors 
badly need and rarely have evaluation skills, if they were all 
even semi-competent formative evaluators, uVi* (at leas> 
quasi-) externality could make them extremely valuable 
since many projects either lack evaluation staff, or have 
none worth having, or never supplement them with exter- 
nal evaluation Monitors have a schizophrenic role which 
few learn to handle, they have to represent and defend the 
agency to the project and represent and defend the project 
to the agency- r an these roles be further complicated by an 
attempt at evaluation 7 They already include.it and the only 
question is whether it should be done reasonably well. 

MOTIVATION The disposition of an organism or in- 
stitution to expend elfprt in a particular direction. It is best 
measured by a study of behavior, since self-reports are 
intrinsically and contextuallv hkelv to be unreliable Ct 
Affect. 

MOTIVATIONAL EVALUATION The deliberate use 
of evaluation as a management tool to alter motivation can 
be content-dependent or content-independent If the 
evaluation recommends a tie between raises and work- 
output which is adopted it may affect motivation; if it cuts 
the (supposed or actual) connection, it will be likely to have 
the opposite effect on motivation. But the. mere announce- 
ment of an evaluation even without its occurrence, and 
certainly the presence of an evaluator, can have very large 
(good or bad) effects on motivation, as experienred mana- 
gers well know ^valuators, on the other hand, are prone to 
suppose tbtat the contents ot their reports are what counts, 
and tend to forget the reactive effects, while thev would be 
the first to suspect the Hawthorne effect in a studv done bv 
someone else 

Establishing a sustained level of self-critical awareness— 
the evaluative attitude — requires a sustained effort bv a 
manager or team leader That effort might comprise ar- 



99 IQr 



rangements for regular external evaluation, or quality or-, 
cles, or simply rQle-modehng self-evaluation When people 
say that "in Japan projects are not evaluated for the hrst ten 
vears" they show a complete misunderstanding of evalua- 
tion, itts not restncted to "major external sijmmative re- 
view " and Japan ist the place where internal evaluation 
(e g b\ quality circles) his become so well-accepted that 
one can maki m'Mm 1 of a king tnal period before the stop and 
go review It would be idiotic to do that in the absence of a 
stVong evaluative commitment and competence in the work 
group . There is no worthwhile commitment to quality with- 
out competent and frequent self-evaluation. 

MULTIPLE-TIER See Two-Tier. 

* 

"MY TRIBE" SYNDROME, See Going Native, 

NATURALISTIC (evaluation or methodology). An 

approach which minimizes mi*ch of the paraphernalia of 
science e g technical jargon, prior technical knowlcige, 
statistical inference, the effort to-formulate general law^, the 
separation of tKJ observer from the subject, the commit- 
ment to a >mgle correct perspective, theoretical structures, 
causes, predictions and prepositional knowledge Instead 
there is a focus on the use of metaphor, analogy, informal 
(but valid) inference vividness of description, reasons-ex- 
planations, interactiveness, meanings, multiple (legiti- 
mate) perspectives, tacit knowledge. For an excellent dis- 
cussion, see Appendix B turalistic Evaluation in Evalu- 
ating with Validity, E House, Sage, 1980.) The Indiana 
University group (Guba and Wolf particularly) have paid 
particular attention to the naturalistic rno^el, and their 
definition (Wolf, personal communication) stresses that it' 
(a) has more orientation towards "currentand-spontaneous 
activities, behaviors and expressions rather than to some 
statement of prestated formal objectives; (b) responds to 
educators, administrators, learners and the public's interest 
in different kinds of information, and (c) accounts for the 
different values and perspectives that exist ", it also 
stresses contextual factors; unstructured interviewing ob- 
servation rather than testing, meanings rath >r than mere 
behaviors Much of the debate about the legitimacy 'utility 
of the naturalistic approach recapitulates the ldiographic/ 
nomothetic debate in the methodology of psychology and 

l o : j w» 



the debates in the analytical philosophy of history over the 
role of laws. At this stage the principal exponents of the 
naturalistic approach (e.g* Stake) may have gone too far in 
the laissez-faire direction (any interpretation the audience 
makes is allowable); but their example has shown up the 
impropriety of many of the formalists' assumptions about 
4 the applicability of the social science model. /\nd a new 
model of excellence is clearly emerging, in Stake's work and 
in Guba and Lu. Join's Effective Evaluation (Jossey-Bass, 
1981). 

NEEDS ASSESSMENT (NEEDS SENSING is a related 
recent variant) This term has drifted from its literal mean- ^ 
ing to a jargon status in which it refers to any study of the 
needs, *vants, market preferences, values or ideals that 
might be relevant to e.g. a program. This sloppy sense 
might be called the "direction-finding" sense (or process), 
and it is in fact perfect legitimate when one is looking for 
all possible guidance m planning, or justification for con- 
tinuance of a program. Needs assessment in the literal sense 
is just part of this and it is the most important part, hence, even 
if the direction-finding approach is taken, one must then 
sort out the true needs. Needs provide the first priority for 
response just because they are in some sense necessary 
whereas wants (merely) are desired and ideals are "ideal- 
istic/' i.e., often impractical. It is therefore very misleading 
to produce something as a NA (needs assessment) when in 
fact it is just a market survey, because it suggests that there 
is a level of urgency or importance about its findings which 
simply isn't there. True needs are considerably harder to 
establish than felt wants, because true needs are often 
unknown to those who have them — possibly even contrary 
to what they want, as m the case of a boy who needs a 
certain diet and wants an entirely different one. 

The most widely used definition of need — the "discrep- 
ancy definition" — does not confuse needs with wants but 
does confcse them with ideals. It defines need as the gap 
between the ac.ual and the ideal, or whatever it takes to 
bridge it. This definition has even been built into law in 
some states. But the gap between your actual income and 
your ideal income is quite different (and much larger) than 
the gap between your actual income and what you really 

O 101 

ERJC 110 



9 

ERIC 



need. So we have to drop the use of the ideal level as the key 
reference level in the definition of noed— which is just as 
well, because it is very difficult to get much agreement on 
what the ideal curriculum is like and if we had to do that 
before we could argue for any curriculum needs, it would be m 
hard to get started. 

A second fatal flaw in the discrepancy definition is its 
fallacious identification of needs with one particular subset 
of needs, namely unmet needs. But there are many things 
we absolutely need — like oxygen in the air, or vitamins in 
our diet — which are already there. To say we need them is 
to say they are necessary for e.g> life or.health, which distin- 
guishes them from the many inessential things in the envi- 
ronment. Of course, on the discrepancy definition they are 
not needs at all, because they are part of "the actual," not 
part of the gap (discrepancy) between that and the ideal. It 
may be useful to use the dietary terminology for met and 
unmet needs — maintenance and incremental needs. People 
sometimes think that it's better to focus on incremental 
needs because that's where the action is req* -/ed; so maybe 
— they think — the discrepancy definition doesn't get us 
into too much trouble. But where will you get the resources 
for the necessary action? Some of them usually come from 
redistribution of existing resources, i.e., from robbing 
Peter's needs to pay for Paul's, where Peter's (the mainte- 
nance needs) are just as vital as Paul's (the incremental). 
This leads to an absurd flip-flop in successive years: it is 
much better to look at all needs in the NA, prioritize them 
(using apportioning methods nit grading or ranking) and 
then ac? to redistribute old and new resources. 

A better definition of need, which' we might call the 
diagnostic definition, defines need as anything essential for a 
satisfactory mode of existence, i.e. , anything without which 
that mode of existence or "level of performance" would fall 
below a satisfactory or acceptable leyel/The slippery term $ 
this is of course "satisfactory" and it is context-dependent; 
satisfactory diets in a nation gripped by famine may.be 
considerably nearer the starvation level than those regarded 
as satisfactory in a time of plenty. But that is part of the 
essentially pragmatic component in NA— it is a prioritizing 
and pragmatic concept. Needs slide along the middle?ange 
of the spectrum from disaster to Utopia as resources become 

M 

102 

in 



available. They never cover the ends of the spectrum->-no 
riches, however great, legitimate the claim that everyone 
needs all possible luxuries. * 

The next major ambiguity or trap In the concept of need 
relates to the distinction between what we can call perfor- 
mance needs and treatment needs. When we say that children 
need to be able to read, we are talking about a needed level 
of performance. When we say they need classes in reading, 
f or instruction in the phonics approach to reading, we are 
talki,ig,about a needed treatment. The gap betv een the two 
is vast, and can only be bridged by*an evaluation of the 
alternative possible treatments that could yieldthe alleg- 
edly needed performance. Children need to be able to con- 
verse—but it does not follow they need classes In talking, 
since they pick it up without any. Even if it can be shown 
Jhat they do need the "treatment" of readin^dasses, that's 
a long way from the conclusion that any particular approach 
to reading instruction is needed. The essential points aie 
that the kind of NA with which one shbuld begin evalua- 
tions is perfomwnce NA; and that treatment needs claims 
essentially require both i performance NA and a full-scale 
evaluation of the relative merits of the best candidates in the 
treatment stakes. 

Conceptual problems not discussed here include the 
problem of whethei there are needs for what isn't feasible, 
and the distinction between artificial needs (alcohol) and 
essential needs (food); methodological problems including 
the flaws in the usual procedures for performing NA are 
discussed elsewhere (IE). 

The crucial persf>ertive to retain on NA is that it is a 
progress for discovering facts about organisms or systems; 
it's not an opinion survey or a wishing trip. It is a fact about 
children, in this environment, that they need Vitamin CancL 
functional literacy skills, whether or not they think so or 
their parents think so or for that matter witchdoctors or 
nutritionists or reading specialists think so. What makes it a 
fact is that the withdrawal of, or failure to provide these 
things, results in very bad consequences, by any reasonable- 
standards of good or bad. Thus, models for NA must be 
models for truth-finding, not for achieving political agree- 
ment. That they are all too often of the latter kind reflects the 
tendency of those who design them to think that value 

i& , "12 



judgments are, not part of the domain of truth. For NA W 
value judgments just as surely as they are matters of fact; 
indeed, they are the key value judgments in evaluation the 
root source of the value that eventually makes the conclu- 
sion an evaluative or\e rather than a purely descriptive one. 
It's easy to see this if we began with a staterftent that re- 
ferred to an ideal as we (implicitly) do with the' discrepancy 
definition; or if we are looking at a treatment-need state- 
ment (since that is an evaluation), And it's eaVy to see that if 
we began with mere market sifrveys, we would not have an 
evaluative conclusion, just a descriptive one (possibly de-, 
scribing a population's evaluation, but not making evalua- 
tions). But diagnostic-definition performance NAs are also 
evaluative because they require the identification of the 
essential, the important, that which avoids bad results. Of 
course, these are often relatively uncontroversial value 
judgments. Evaluations build on NAs like theories build on 
observations; it's not that observations are infallible, only 
that they're much /ess fallible than theoretical speculation. 

NIH^TTRAL Not supporting any of the warring factions, 
i.e. a political category. No more likely to be right than they 
are; more likely to be ignorant; not always more likely to be 
objective. Hence, to be used with care, not as the always- 
^deal choice for judges, juries, evaluators, etc. Non-evalua- 
tive language is also said to be neutral; it is no more objective 
than evaluative language e.g. "We have just received a visit 
from exira-terrestrial reings" may be far more tendentious 
fhan "He's a murdcrei^br vice verse 

NORMAL DISTRIBUTION (Stat ) Not the way 
> things are normally distributed, though some are, but an 
ideal distribution which results in the familiar bell-shaped 
curve (which, for example, is perfectly symmetrical though 
few real distributions are), A large part of inferential statist- 
ics rests on the assumption that the population from which 
we are sampling is normally distributed, with regard to the 
variables of interest, and is invalid if thi* assumption is 
grosslv violated as it quite often is. Height and eye color are 
often given as examples of variables that are normally dis- 
tributed but neither are well-supported examples. (The 
& term "Gans^an distribution" is sometimes and much less 

confusingly used for this distribution ) 

ERiC . 11 J 



NORMATIVE A technical term in the social sciences 
that has essentially come to mean ''evaluative." It came to 
have that meaning (one surmises) because the only factual 
9 basis that the so-called "empiricists" could see for evalua- 
tive language was as-a way to express a judgment of de- 
viance from a nojpm (e.g. of acceptability or merit >Hence to 
say a performance was "excellent" was to say it was a 
standard deviation or two above the norm (or average or 
mean.) This superficial analysis led to such practices as 
"grading on the curve" i.e the confusion of "better" with 
"good/' It is superficial because— amongst other reasons- 
it fails to account for the meaning of "above" in "above the 
norm" which is of course an evaluative term but not reduc- 
ible to statistics— or, if so reducible, is only reducible in a 
much more sophis:- *ed way— which would then show 
the original reduction to be superficial The language of 
ranking /> norm-related, the language of grading is not 
(entirely) and the use of the^term "normative" confounds 
the two. 

The usual contrast of "normative" with "descriptive" 
would make no sense on this analysis, since "normative" 
Has carefully beer ven a descriptive meaning. In fact, 
"normative" does contrast with descriptive, in the usu^l 
social scientists' usage, because they ignore the analvsis and 
use "normative" to mean "evaluative" But then, of course, 
the term "normative" is entirely -edundant, a monument to 
a dead analysis and a love of jArgon. Like the equally con- 
fused use of the term "value judgment" to mean "expres- 
sion of preference," the term "mjrmative" sacrifices a useful 
term to a false god. "Nomictfi^" should simply mean "di- 
rectly referring to norms, whether they are descriptive or 
evaluative" and thus should^oyer "unusually tall*, * Jtypi- 
cal", "different" etc., as well asT^nlqngJanguage (which is 
evaluative) but should not refer to "worthless," "perfect." 
But Jo not such terms refe$ to standards i.e njrms of 
value— and hence qualify as normative' So they do, in a 
sense But it's an irrelevant sense. "Htfecves are blue" 
refers— in that sense— to standards of color; "He's 6'3" 
refers to standards of length But those are the paradigms of 
ilescnptivt* language, so that sense of "normative" destroys 
the distinction "Prescriptive" is a somewhat better term to 
use, as a contrast with "descriptive " * ^ 



us 114 



ERLC 




NORM-REFSRENCED TESTS These are typically con- 
structed so as ^o yield a measure of relative performance of 
the individual (or group) by comparison with the perfor- 
mance of other individuals (or groups) taking the same test 
v e.g. in terms of percentile ranking (cf. Criterion-Refer- 

enced Tests). That means throwing out items which 
<nearly) all testees get right (or wrong) because those item£ 
do not "spread" the testee population, i.e. help in ranking. * 
What's left may or may not give a very reliable indication of x 
e.g. reading ability as such (by contrast with better reading 
ability). Since ihe simplest and often the best quick way to 
determine whether a test involves unrealistic standards is 
by findin^ut how many students in the state succeed, at 
that level norm-referencing is a valuable part of any testing 
program. It is not ideal as a sole basis since it make£ dis- 
criminating or competing more important than (or the only 
meaning of) achieving, and severely weakens the test as an 
indicator of mastery (or excellence or weakness), which vou 
. should also know about. The best compromise is a criterion- 
referenced test on which the norms are also provided, 
where the criteria are independently documented needs. 

NULL HYPOTHESIS The hypothesis that results are 
due to chance Statistics onfy tells *us about the null hy- 
pothesis; it is experimental design that provides the basis 
for inferences to the truth of the scientific hypothesis of 
interest. The "significance levels" referred to in experimen- 
tal design and interpretation are the chances that the null 
hypothesis is correct. Hence, when results "reach the .01 
level of significance" that me^ns there's only one chance m 
a hundred that they would be due to chance. It does not 
mean tha* there's a 99 percent chance that our hypothesis is ' 
correct, because, of course, there may be other explanations 
of the result that we haven't thought of See Hypothesis 
Testing. f 

NUT ("making the nut") Management consultant jar- 
gon for the basic cost o' running the business for the year 
After "making the nut" one may become a lit e choosier 
about which jobs to take on, and what rates to set. 

OBJECTIVE Unbiased, which does not exclude evalua- 
tive or controversial Cf. Neutral, Bias, Subjectivity. 

EBJC (115 ™ 



V 



, OBJECTIVES • The technical sense of this term refers to 
a rather specific description of an intended outcome; the 
more generaf description is referred to as a goal. 

OBSERVATION The process or product of direct sen- 
sory inspection, frequently involving trained observers. 
The line between observation and its norrrtal antonym "in*- 
terpretation" is not sharp and is in any, case context- 

# dependent, i.e. what counts as an observation in one con- 
^ text ("a very pretty dive") will count as ar^interpretation m 
another <wherejhe diving judges' sconrfs appealed). Just as 
it is very difficult to get trainees* in evaluation— even those * 
with considerable scientific trainings to write non-evalua- 
tive descriptions of something that is to be evaluated, so it is 
difficult to get observers to see only what's there rather than 

. their inferences from it. The use of checklists and training 
can produceVery ;;reat increases in reliability and validitv in % 
observers; observation is thus a rather sophisticated proc- 
ess, and not to be equated with the amateur's perceptionsor 
reports on them It should be clear from the above that there 
are contexts ir> which observers, especially trained obser- 
4 vers, can corfectly report their observations in evaluative 
t terms (An obvious ejf&mple, where rto special training is 
involved, is reporting scores at a rifle range.) 

OPPORTUNITY COSTS Opportunity costs are what 
9 one gives up by ending in a particular activity. The same 
concept applies ta investing money or any other resource 
There i]re alwaif* opportunity costs; one at least has to give 
up leisure to do something, or give up work to do nothing, 
i e , enjoy leisure Calculating them (lik^ profit) is a con- 
ceptual task first, and an arithmetic one later. In the first 
place, there is always an infinity )f alternatives to any ac- 
tion, all ot which one gives up. Does it follow thatopportun- 
itv costs are alvvavs infinite? The convention is that th^OC is 
the value ot the wit*/ valuable of *hese. So, calculating one 
OC often involves calculating a great manv costs of alterna- 
tives See Cost 

OPTIMIZATION The decision strategv according to 
which one should select the alternative with the maxthium 
expectancy (i e product of the probability of that payoff by 
its utility, should it eventuate). It is evaluatively always the 
correct strategv it the analysis is done correctly e g. by 



LUrodu^fng the utility of risk os gambling, of anxieties 
raised a Ad time spe|\t on calculating strategies, and of infor- 
mation tnat could be obtained by exploration or inquiry and" 
could ^ear on the best mainstream strategy. (Of course, 
search has its own costs, as does costing search^Only com- 
puters could handle the full spread of calculations, espe- 
cially when there is a high "ne<?d for speed" e.g. in weapon * 
systems control.) Descriptively, people often operate on 
I alternative strategies e.g. satisficing, minimax, maximax; 
these are approximations to optimization in certain limiting 
cases And, of course, they use their estimates of the utility 
of the alternatives, not necessarily the true utility of ^hose 
alternatives to them. Additionally, some of them— o* us — 
simply choose without thought or based on irrational 
thought (some or much of th/ time). Evaluation should be 
mainly concerned with- identifying correct choices not pre- 
dicting actual choices (the task of the psychologist, econ- 
omist, sociologist, etc.) 

OUTCOME EVALUATION Outcomes are usually the 
post-treatment effects, there are often effects dunng treat- 
ment, e.g. enjoyment of a teaching style, \vhi<ih we some- 
times (casually) call process. Quite different outcomes may 
occur to different groups of impactees or "true consumers, " 
just as costs differ from group to grouj^Cests are not 
sharply distinct from outcomes e g staff exhaustion is a cost 
and an outcome of a demanding program (a recoil effect)) 
Outcomes may be factual and evaluative (reduction of mur- 
der rate) or factual and not evaluative (reduction otmoibid- 
ity) in which case they have x be coupled to needs asSv ,s- 
ment results to get into an e\ .uative conclusion. See Pay- 
off Evaluation. 

OVERLEARNING Overtiming is learning past the 
point of 100% recall, and is aimed at generating long-term 
retention. In order to avoid boredom on the part of the 
/ learner, and for other reasons, the best way to do this is 
I through reintroducing the concept (etc.), in a variety of 
different contexts One reason that long-term studies, or 
the follow-up phase of an evaluation, often reveal grave 
detenoration of learning is that people have forgitten the 
distinction between learning to cn tenon at t] and learning to 
cntenon at h; in fact, the latter /s the correct criterion, where 

ER?C ? . "* 



t2 is the time ivhen the knowledge is needed, while tt is the 
end of the instructional period • 

PAD,f ADDING When a bidder goes up with a budget 
for a proposal, there has to be— one way or another— 
some allowance in it far unforeseen eventualities— at least 
if it is to be done according to sound business practices. This 
is often refem*U^as "the pad/' and the practice of doing 
this is the letifitttnte version of "paddin'g the budget." Pad- 
ding the budget is also used as a term to refer to illegitimate 
addition/ to the budget (excessive profits); but it must be 
fealizedythat the pad is the only recourse th^t the contractor 
has for handling the obvious unreliabilities in predicting the 
ease of implementing some complicated testing program, 
the ea&? of designing a questionnaire that will get past the 
questio 

etc. See &qsU4u& * 



PARADIGM Ah-e<tremely general conception of a dis- 
cipline, which may be very influential in shaping it, e.g. 
'*the classiml^QCiai^crence paradigm in evaluation." 

PARALLEL DESIGNS (in evaluation) Those in which 
two or more evaluation teams or evaluators proo£d inde- 
pendently (not necessarily concurrentlv). Thev are of great 
. importan^because of the light they shed on the unknown 
extent of inferevaluator agreement; because such a process 
increases the car^ with which each team proceeds; and 
because the reconciliation process (synthesis; leads to a 
x deeper analysis than is achieved by the evaluators indepen- 
dently. For these reasons, it is usually better to spend a 
given evaluation budget on two smaller teams getting half 
as much than or. one well-funded team. Bu* managers 
dislike this idea |ust because the teams may disagree —their 
' real virtue 1 

PARALLEL FORMS Versions of a test that have been 
tested for equal difficulty and validity. 

PARALLEL PANELS In proposal review, for example, 
it is important to run independent concurrent panels in 
order to get some idea of the reliability of the ratmgs they are 
producing. On the few occasions this has been done, the 
results have been extremely disquieting, since unreliability 
guarantees both invalidity and injustice. One would expect 

ERIC w **118 



a federal science foundation to have enough commitmentto 
validity and justice to do routine checks of this kind, but 
thev usually cry poormouth instead of looking for ways to 
get validity within the same budget. In any case, dispensing 
funds invalidly and unfairly is not justified by saying it 
would cost slightly more to do it reasonably well, even if 
true, since the payoffs would be higher (from the definition 
of "doing it" and "reasonably well"), and justice is sup- 
posed to be worth a little. 

PARETO OPTIMAL A tough criterion for changes ix\ 
e.g. an organization or program which requires that 
changes be made only if nobody suffers and somebody 
benefits. Crucial feature is that it appears to avoid the prob- 
lem of justifying so-called "interpersonal comparisons of 
utility," i.e., showing that the losses sorre sustain as a result 
of a change are less important than the gains made by 
others Improving welfare conditionl*by raising taxes is not 
pareto optimal, obviously. But selecting between alterna- 
tive pareto optimal changes still involves relative hardship 
and benefit considerations. A major weakness in Rawls' 
^ theory of justice is the commitment to Pareto optimality 

^ PARETO PRINCIPLE A fraftagement maxim possibly 
more illuminating than the Peter Principle and Parkinson's 
Law; it is "sometimes described as the 80/20 rule, or the 
"principle of the vital few and the trivial marfy," and asserts 
that about 80% vf significant achievement (e.g. at a meet- 
' ing) is done by about 20% of those present; 80% of the sales 
come from 20% of the salespeople, 80% of the pay-off from 
a task-list can be achieved from 20% ot the tasks, etc. Worth 
remembering because it's sometimes true, and often sur- 
prising " - 

PARKINSON'S LAW "Work (and budgets, timelines 
and staff size), expands tp fill the space, time and funds 
available." If the converse wefc true it would mean we 
could do everything by allowing no time for it; but as it 
stands it is a considerable insight about large organizations. 
The fact that bids on RFP's come in Jose to the estimated 
limit may not illustrate this, only that the work could be * 
done at various levels of thoroughness or that RFP writers 
aren't dumb 

ERIC no 



V 



PATH ANALYSIS 'A procedure for analyzing set? of 
mathematical relationships which can shed some light on 
the relative importance of the variables. It may even place 
some constraints on their causal relationship. It cannot defi- * 
nitely identify any one or group of them as a cause. 

PAYBACK (PERIOD) A term from fiscal evaluation 
which refers to the time before the initial cost is recovered; 
- the recovery cash flows should of course be time-dis- 
counted. Payback analysis is what shows that buying a 
$12,000 word-processor may be sensible even if the price 
will probably drop to $80p0 in a year; if the payback period 
is, say; 15 months (typical of many carefully-chosen instal- 
lations), you will in fact lose several thousand dollars .by 
waiting-for the price to drop. 

PAYOFF EVALUATION Evaluation focused on re- 
sults; the method of choice apart from costs, delay, and 
intervening loss of control or responsibility (See Process 
Evaluation.) Essentially similar to outcome evaluation. 

PEER REVIEW Evaluation, usually t>f proposals or col- 
lege faculty, done by a^anel of judges with qualifications 
approximating those of the author or candidate. The tradi- 
tional approach but extremely shaky. Matched panels pro- 
duce different results, fatigue, learning and halo effects alfe 
widespread, etc. The process can be greatly improved, but 
there's little interest in doing so, possibly because it's pri- 
marily serving as a legitimating or symbolic kind of evalua- 
tion, not a truth-seeking one. Possibly the reluctance is just 
due to ignorance of the social cost of errors, plus nervous- 
ness about the rime-costs for panelists. See Calibration, 
Two-Tier. 

PERCENTILE (Stat.)" If you arrange a large group in 
the order of their scores on a test, and divide them into 1UU 
equal-sized groups, beginning with those who have the 
lowest score, the first such group is said to consist of those 
in the 1st percentile (i.e., they have scores worge than $9% 
of the group), and so on to the top group which should be 
called the 100th percentile: for boring technical reasons the 
actual procedure used only distinguishes 99 groups, so the 
best one can do is get into the 99th percentile. Witb smaller 
numbers or for cruder estimates, the total group is divided 

EM; „ ■ '" 120 , 



9 



into ten deciles; similarly for four quarhles, etc 

PERFECTIONISM" Marks' Principle: "the price of per- 
fection is prohibitive." Never get letters or papers retyped 
when fully legible corrections can be made by hand; there 
aren't enough trees; days or dollars for that. Legal docu- 
ments and typographical works of art may be exceptions, 
but the Declaration or Independence has two insertions by 
the scribe so there's a precedent in a legal case. (Cited by 
Bliss.) 

PERFORMANCE CONTRACTING The system of hir- 
ing and' paying someone to deliver (e.g. educational) ser- 
vices by results. They might be paid in terms of the number 
of students times the number of grade equivalents their 
scores are raised. Widely tried in the 60s, now rare. Usual 
story is that it didn't work or worked only by the con- 
tractor's staff cheating ("teaching to the test")* Actual situa- 
tion was that the best contractors did a consistently good job 
but the pooled results of all contractors wer^mfc good. As 
with most innovations, the total lack of sophistication (in, 
evaluation) of the educational decision makers treated this 
result as grounds for giving up, instead of for hiring the 
better infractors, from which we might have gone on to 
still better teaching methods for everyone. See regression to 
the mean for an example of the need for some sophistication 
in setting the terms of the contract, 

' PERSONNEL EVALUATION Personnel evaluation ty- 
pically involves an assessment of job- related skills, in one or 
more of five ways; first, judgmental observation of job-per- 
formance by untrained but well-situated observers e.g. co- 
workers; second, judgmental observation by supposedly 
skilled and certainly more experienced observers e.g. 
supervisor or personnel manager or consultant; third, direct 
' measurement of job performance parameters, on the job, by 
calibrated instruments (human or, usually, other); fouHh, 
observation or measurement or evaluation of performance 
on job simulations; fifth, the same on paper and pencil tests 
which examine job-relevant knowledge or attitudes. These 
results must be (a) synthesized; (b) related to an analysis of 
the job requirements (the needs assessment). These two 
skills are usually far more difficult to master than the per- 
formance rating, and are usually underemphasized. Per- 

ERIC ^ 112 



sonne! evaluation not only involves ethical constraints 
upon the way it should be done, it must also involve an 
ethical dimension on which the performance of the person- 
nel is scored The importance of that will vary depending 
upon the amount of authority and interpersonal contact of 
. the individual being evaluated. There are anumber of stan- 
dard traps in personnel evaluation which invalidate most of 
the common approaches For example, the failure to pro- 
vide appropriate levels of anonymity tor the raters, 
tent with relevant legislation, or a general fear of^ad- 
mouthing others because it involves the sin refenW to in 
"judge not that ye be not judged/' leads to an unwilUhgness 
to voice criticism even if deserved; this (solvable) problem 
requires sustained and ingenious attention The scales used 
in personnel evaluation are rarely based upon Venous }ob 
analysis and consequently can hardly give an accurate 
picture of someone's performance tor evaluation purjroses 
Another common mistake is to put style variables into eval- 
uation forms or reports, in situations wheVe no satisfactory ' 
evidence exists that a particular stvle is superior to others 
Even when style variables have been validated as indicators 
ofsupenor performance, they typically cannot be used in 
personnel evaluation because the correlations between 
their presence and good performance are merely statistical 
and are thus as illegitimate in the evaluation oi individual* as 
skin color, which of course does correlate statistically with 
various desirable and/or undesirable characteristics, "Guilt 
by association" is as inappropriate when the association is 
via a common style as when it is via a common fnond, race 
or religion See Style Research. A third common fallacy is 
fuyvy dependence upon MBO (Management bv Objec- 
tives) techniques These are essentiallv limited because thev 
ignore good shooting at targets of opportumtv, and good 
fire-tightmg skills, and they are too easilv manipulated See 
Goal-Free Evaluation. 

PERSON- YEARS See Level of Effort. 

PERSPECTIVAL EVALUATION This approach to or 
part of an evaluation requires the evaluator to attempt vari- 
ous conceptualizations of the program or product being 
evaluated Programs and products can be seen from manv 
different perspectives which affect every aspect of the eval- 



ERIC 



uation, including cost analysis Advocate-adversary is a 
special case of perspecival evaluation; consumer-based or 4^ 
manager-based evaluations are special perspecTives~7Ks m - - 
architecture, multiple perspectives are required in order to 
see something in full depth. Different from illuminative, 
responsive etc. in its total commitment to the view that ^ 
there /s an objective reality of which the perspectives are 
merelv views and inaccurate by themselves. The correct 
strand m the naturalistic approach-stresses this, the weak 
strand favors the "each perspective is legitimate" approach, 
which is false if the perspective is claimed to be the reality 
and not /«s/ one aspect of it. 

PERT, PERTCHART Stands for Program Evaluation 
Review Technique, a special type of flow charting, of which 
perhaps the most interesting feature is the fact that an effort 
is made to project times at which various points in the 
project's development will be reached (and ouiputs at those 
points) at three levels, namely the maximum likely, the 
minimum likely and the most probable (date or level) This 
provides a good approach to contingency planning, in the 
hands of a skilled manager. As with all these devices, thev 
can become a pointless exercise if n*,* closely tied to reality, 
and the tie to reality can't be read off the chart. 

PLACEBO EFFECT The effect due to the ileltve. v context 
of a treatment as opj>osed to the delivered content In medi- 
cine, the placebo is a dummy pill, given to the control group 
in exactly the same way as the test drug (or more generally, 
the experimental treatment) is given to the experimental 
group, 1 e,, with the nurses, the doctors and (sometimes) 
the patients m ignorance as to whether the pill is a placebo 
* or not (Notice tha* there are two errors in this as a valid 
design tor u1enttf\ftn$ placebo effect, but it's a considerable 
improvement over giving no placebo to the control group.) 
"Bedside manner" carries the placebo effect with it and 
since tt is estimated that pnor to the sulfa drugs, 90% of all 
therapeutic results were due to the placebo effect, it's a little 
unfortunate that bedside manner gets little play in medical 
practice and training ^and, until 1948, no research) This 
was presumably because of the status need to distinguish 
medicine from taith-heahng Legitimation of placebo re- , 
search was therefore greatly facilitated bv th*.* use of the 





Latin name, which gave it medical respectability. Psycho- 
therapy has been sa?d to be entirely placebo effect (Frank); a 
design to investigate this view presents interesting chal- 
lenges. In education and other human service areas, the 
placebo effect is roughly equivalent to the Hawthorne effect 
which probably accounts for most successes with innova- 
tions. ^This is as licit as bedside manner, as long as it is not 
ascribed to the snake-oil itself. But if we're honest about it 
being only a placebo, won't the placebo effect evaporate? 
Not if the charismatic context is preserved; "the heart has its 
reasons that Reason doesn't know." One should not think 
that the Hawthorne effect (or the slightly less general pla- 
cebo effect) are the results of enthusiasm or chansma on the 
part of the service provider sine, they occur in its absence 
But enthusiasm may precipitate Hawthorne/placebo (H/p) 
effect — it's one special case. It is probably correct to regard 
H/p effect as an effect of the beliefs of the treatment-reci- 
pient, but it would then be very difficult Ho demonstrate 
because it is certainly not an effect restricted to conscious 
beliefs — and the unconscious ones are hard to establish. So 
it is here defined without reference to belief 

PLANNING (evaluation in) See Preformative Evalua- 
tion. 

POINT CONSTANCY REQUIREMENT (PCR) The 

requirement on numerical scoring e.g. ot tests that a point 
d however earned (i e on whatever item and for whatever 
increment of performance on a particular item), should 
reflect the same amount of merit. It is connected with the 
definition of an interval scale If the PCR is violated, addi- 
tivity fails, i.e some total performance A will add up to 
more points than performance 3 although it is ir c act in^ 
ferior. PCR is a verv severe requirement and rarelv even 
attempted in t *ny serious way, hence one should normally 
give a holist grade as well as a score to tests to provide some 
protection against PCR failure. The key to PCR is the rubric 
m essay/simulation scoring and item-matching on mul- 
tiple-choice tests * 

tfBlNT OF ENTRY The point of entry problem is^the 
problem for Ihe client of when to bring an evaluator in on a 
project, and the problem for the evaluator of the point in the 
time flux of decisions when s/he should start evaluating the 



options (critical competitors). Project directors and program 
managers often feel that bringing in an external evaluator 
(and often any evaluator ^t all) at the very beginning of a 
project is likely to prpduct-a "chilling effect/' and that the 
staff should have a chance t6 "run with the ball" in the way 
they think is most likely to be productive for at least some 
time without admonitions about measurabilitv of results, 
etc. The result is often that the evaluator is brought in too 
late to be able to determine base-line performance, and too 
late to set up contrbl groups and is hence unable to de- 
termine either gains or causation, to mention onlv two of 
the major problems that occur in trying to do evaluations of 
projects that were designed without evaluation in mind. 
This is not to say that evaluators never or jarely e^ert a 
chilling effect, they often do Often they could have avoided 
*K> sometimes not. GFE is one way to avoid it but impossible 
in the planning phase. It's possible on a small project to 
have an evaluator in for at least one scncs of'discussiOH* 
during the planning phase, maybe get by without one for a 
while after that, bring one or more back in after things begin 
toja^e shape, and perhaps dispense with most of them 
again for a second period of "unfettered creativitv." How- 
ever, there are many good evaluators that exert a constantlv 
supportive anji helpful effect on projects, in spite of being 
on board all the time. They will need external evaluation 
help to avoid the bias of co-option, but on a fay project 
there's really no alternative to an in-house early-on-board 
evaluation staff. From the evaluator's point of view, the 
question is what to consider "fixed," what to consider as 
pointless second-guessing in doinj; an evaluation/Suppose 
that one is brought in very late m a project For formative 
evaluation purposes, there's reallv no point in second-^ 
guessing the early decisions about the form of the project, 
because they're presumably irreversible For summative 
evaluation, it null be necessary to second-^uess those, and 
that means that the point of entry of the summative evalu- 
ator will be back at the moment when the project design was 
being determined, a point which presumablv antedates the 
allocation of funds to the project The formative evaluator, ^ 
however, should in fact not be restricted to looking at the set 
of choice points that are seen bv the project staff as down- 
stream from the point at which thi^evaluator is called in For 




U6 



3 

the formative evaluator, the correct point of entry for evalu- 
ation purposes is the last irreversible decision Even though 
the staff hasn't thought of the possibility of reversing some 
earlier decisions, the formative evaluator must look into 
such possibilities and the cost/Value of reversals. 

POLICY ANALYSIS The evaluation of policies, plans, 
proposals; a ^'normative" (better, "prescriptive") quasi- 
science, and very closely related in its recency and metho- 
dology to evaluation as a discipline. 

POLITICS OF EVALUATION Depends i one's role 
and the day of the week, one *r likely to ti J politics as 
dirty politics — an intrusion into scientific evaluation — or as 
part of the ambient reality w » .,ch evalua tors are too often too 
careless about including as relevant considerations. If one 
has a favorable attitude towards politics, or uses the term 
without pejorative connotations, one will include virtually 
all program background and contextual factors in -the politi- 
cal dimension of program evaluation. The jaundiced view 
simply defines it as the set of pressures that are not related 
to the truth or merits of the case The politics of com- 
petency-based testing as a requirement for graduating is a 
good example. The situation in many states is that it has 
become "politically necessary" to institute such require- 
ments, now or in the near future, although the tray in which 
they have been instituted virtually destroys all the reasons 
for the requirements. That is, the requirement for gradua- 
ting from the 12th grade is "basic skills" at the 7th or 8th or 
5th grade level depending on the state; no demonstration of 
other skills; not even any demonstration of application skills 
on the basics; the exams set up so that multiple retakes of 
exactly the same test, or a very few versions of it, are possible 
(hence there is no proof that the skill is present); teachers 
have access to and teach to the test; other subjects are 
completely dropped from the 11th and 12th grade curricu- 
lum in order to make room foi yet more repetitive teaching 
of drill- level basics, etc A strong case can be made that this 
version of MCT does more harm than good, though a seri- 
ous version would certainly contribute towards truth-in- 
packaging of the diploma te. This is politics without pay- 
off. But on manv occasions, the "politics" is what gets 
equitv into personnel evaluation, and racism out of the 



117 

12G 



curriculum, thouph it also keeps moral education out of the 
public schools, a rrible handicap for the society. Better 
education about jd in evaluation may be the best route to 
improvement, sh6rt of a political leader with the charisma 
to persuade us of anything and the brains to persuade us to 
improve our self-criHcal skills. 

POPULATION (Stat) The group of entities from 
which a sample is drawn, or about which a conclusion is 
stated. Originally meant people, obvious extension to 
things (e.g. the objects on the production line, which is the 
population which is sampled for quality control studies); 
less obvious extensions to circumstances (a field trial 
samples the population of circumstances under which a 
product might be used); still fancier extensions m statistical 
theory to possible configurations, etc. 

PORTRAYAL Semi-technical term for an evaluation- 
by-(rich)-description, perhaps using pictures, quotes, pho- 
tographs, poetry, anecdotes as well as observations. See 
Responsive Evaluation, Naturalistic Evaluation. 

POSTTEST The measurement made after the "treat- 
ment," to get absolute or relative gains (depending on 
whether the comparison is with pre-test scores or compan- 
son group scores.) 

POWER (of a test, design, analysis) An important tech- 
nical concept involved in the evaluation of experimental 
designs and methods of statistical analysis, related to effi- 
ciency. It is in tension with other desiderata such as small 
sample size, as is usual with evaluative criteria. 

POWFR TEST A speed test— one where e.g. the num- 
ber of itc ms answered conectly in a given time is of impor- 
tance (e.g. a typing test). 

PFBS Program Planning and Budgeting System The 
maragement tool developed byMacNamara and others at 
Ford Motor Company and taken to the Pentagon when 
MacNamara became Secretary of Defense; since then 
widely adopted in other fed* -al and state agencies. Princi- 
pal advantage and feature: identifying costs by program and 
not by conventional categories such as payroll, inventor)', 
etc. Facilitates rational planning with regard to program 

ERIC 1*' m 




^ continuance, increased support, etc. Two problems: first 
it's too often ^virtually always) instituted as a mere change 
in bookkeeping procedures, without a prograrn^w/wflfia/t 
component worth the name, so Lis gains in decisionvalid- 
ity don't occur. Second, it's ottfri very expensive to imple- 
ment and unreliable in distribution of overhead and it never 
seems to occur to anyone to evaluate the problem and cost 
of shifting to PPBS before doing it, a typical example of 
missing the point of the whole enterprise. Cf. Meta-evalu- 
ation. Mission Budgeting. 

PRACTICE EFFECT The specific form of practice effect 
refers to the fact that taking a second test with the same or 
closely similar items in it, will result in improvement in 
performance even if no additional instruction or learning 
has occurred between the two tests. After all, one has done 
all the "01 ;anizing of one's thoughts" before the second 
test. There is a general practice effect, which is particularly 
important vith respect to individuals who have not had 
much recei it experience with test-taking; this practice effect 
simply reft rs to improving one'sjest-taking skills through 
practice, p.g. one's ability to control the time spent on each 
question, to understand the way in which various types of 
multipie-choice questions work. etc. T>e more speeded the 
— test is, the more serious the practice effect is likely to be. The 
use of control groups will enable one to estimate the size of 
the practice effect, but where they're not possib e, the use of 
a posttest-only design for some of the experimental group 
will do very nicely instead, since the difference between the 
two sub-groups on the posttest will give an indication of the 
practice effect, which one then subtracts from the gains of 
the posttest only group in order to get a measure of the 
gains due to the treatment. 

PREDICTIVE VALIDITY See Construct Validity. 

^ PREFORMATTVE (evaluation) Evaluation in the plan- 
ning phase of a program; typically involves gathering base- 
line d/.ta, improving evaluability, designing the evaluation, 
improving the planned program etc. See Evaluation. 

PREORDINATE EVALUATION See Responsive 
Evaluation/ 



PRBSS RELEASES The rules are: (1) Don't bother to 




ERIC 



119 



128 



ERIC 



1 * 

hand (or send) 6ut the technical version, even as a supple- 
ment (2) Don'tjj&ther to hand 01 t a summary of the techni- 
cal document. (3) Don't bother to hand out a statement 
which says favorable things and then qualifies them— 
either the qualific3tipn or the favorable comment will be 
* dropped. (4) Issujj/only a basic description of the program 
itself plus a single overview claim, e.g. "Results do not as 
vet show any advantages or disadvantages from this ap- 
proach, because it's much too e^rly to tell. May have a 
definite conclusion in n months." (That's an interim release; 
in a final release you drop the second sentence.) 

PRETEST Pretests are normally said to be of two kinds; 
diagnostic and baseline. In a diagnostic pretest, the peda- 
gogical (health etc.) function is to identify the presence or 
absence of the prerequisite skills, or the places where re- 
mediation instruction should be provided. These tests will 
typically not be like the postte$ts. In baseline pretesting, on 
the other hand, we are trying to determine what the level of 
knowledge (etc.) is on tne criterion or pay-off dimensions, 
and hence it should bp matched exactly, fordifficulfy, with 
the posttest. Instructors often think that using this kind of 
pretest will have bad results, because students will have a 
"failure experience." Properly managed, the reverse is the 
case; one frequently discovers that some px all students are 
not as ignorant «s one had thought abattf the subject matter 
of the course, in which case very uVeful changes can be 
made in content, or "challenging out" can be allowed, with 
a red iction in costs to the student and possibly to the 
instructor. Moreover, the pretest gives an excellent and 
highly desirable preview of the kind of work that will be 
expected, and if it is— as it should be — gone over carefully 
in class, one has provided students with an operational 
definition of the required standards for passing. Further- 
more, one has created a quite useful climate for interesting 
the students in early discussions, by giving them a chance to 
try to solve the problems with then native wit, ancNhen 
explaining how the content of the co arse helps them rcH&) 
better. In many subjects, though not ait, this constitutes « 
very desirable proof of the importance of the course. Of 
course, treating the pretest as defining the early 
course content is likely to qualify as teaching to the test if 
one uses manv of the items from the pretest in the posttest. 

t~ 12U ,20 



But there are times when this is entirely appropriate; arid in 
general it is very sensible to pull items for the posttest out of 
a pool that includes the items from the pretest, so that at 
least some of them will be retested. This encourages learn- 
ing the material covered in the pretest, which shou ] d cer- 
tainly not be excluded from the course just because it has 
already been tested. Instructors who begin to give pretests 
also begin to adjust their tea^hin^in a more flexible way to 
the requirements of a specific class, instead of using exactly 
the same material repeatedly. Thus the use of a pretest is an 
excellent example of the integration of evaluation into 
> teaching, and a case of evaluation procedu r es paying off 
through side effects as well as through direct effects (which, 
in this case, would be the discovery that students are not 
able to learn certain types of material from the text; notes 
and lectures provided on that topic.) Similar points apply to 
pretesting in other fields e.g. health. 

PRICE The charged cost (charged by vendor to con- 
sumer or client), usually a small part of true cost, and often 
mojre than true worth. 

PROCESS EVALUATION Usually refers to evaluation 
of the treatment (or evaluand) by looking at it instead of the 
1 \ outcome. With exceptions to be mentioned, this is only 
wgijimate if some connection is known (not believed) between 
process var bles ar\d outcome variables, and it is never the 
best approach because such connections, where they do 
exist, are re 1 lively weak, transient, and likely to be irrele- 
vant to many new cases. The classic case is evaluation of 
teachers by classroom observation (the universal procedure 
K-12), where there are wVvaluation-useable connections 
between classroom behavior and learning outcomes, quite 
apart from the problem that the observer's presence pro- 
duces atypical teaching behavior, and the observer is nor- 
mally someone with other personal relaticifc. with the 
teacher that are highly conducive t:> bias. (Theelaluatidh of 
administrators is no better.) Certain aspects of process 
^ should be looked at, as part of an evaluation, not as a substi- 
tute for inspection of Outcomes, e.g. its legality, its morality, 
its enjoyability, implementation of alleged treatment, and 
whether it can provide any clues about causation. It is better 
to use the term mediated evaluation to refer to what is 



ERIC '130 



described in the opening sentence of this entry, an/allow 
process evaluation to refer to that and to the direct evalua- 
tion of process variables as part of an overall evaluation 
which involves looking at outcomes. People sometimes 
wrongly suppose that process evaluation is the proper 
method when formative evaluation is being done, an ex- 
ample of the homeopathic fallacy ("use a process to evaluate 
a process"). Checking on the adequacy of the literature 
search falls under process; checking for sexist language; for 
the presence of a quality control checking system; on the 
hidden costs and causal connections; on the ethicality of 
testing procedures; on the accuracy of the warranty claims; 
on the validity of instruments used — these are some of the 
less obvious process checks. Some of them lead to changes 
in the basic description of the evaluand; some to changes in 
outcomes or costs; some finish up ase.g. ethical conclusions 
about the process which do Aot have to pick up a value 
component from the needs assessment in order to graduate 
into the final report. 

PRODUCT Interpreted very broadly, e.g. may be used 
to refer to students, etc., as the "product" of a * ung 
program; a pedagogical process might be the product of a 
Research and Development effort. 

PRODUCT EVALUATION The best-developed kind of 
evaluation; Consumer Reports used to be the paradigm 
though it has deteriorated significantly in recent years. See 
Key Evaluation Checklist, Ref. "Product Evaluation" in 
New Techniques for Evaluation, ed. N. Smith, Sage, 1981. 

PROFESSIONALISM, PROFESSIONALLY Some- 
where above minimum competence in a profession but 
short of the realm of professional ethics there is a set of 
obligations e.g. to keeping current, and to self-evaluation, 
which should be supported and counted in personnel 
evaluation Professional ethics for quarterbacks prohibits 
kick-backs, professionalism requires kicking practice. 

pkOFTT This term from fiscal evaluation has unfortu- 
nate connotations to the uninformed. The gravity of the 
• misconception becomes clear when a non-profit organiza- 
tion starts doing serious budgeting and discovers that it has 
to introduce something which it can scarcely call profit, but 



which does the same job of funding a prudent reserve, new 
programs and buildings, etc. (It calls it "contribution to A 
margin/' instead.) The task of defining prof itis essentially a 
philosophical one. Granted that we should distinguish 
gross profit from net profiT^nd that gro^s "profit" has to 
coverall overhead (e.g. administrative, amortization, insur- 
ance and space expenses) which may leave no (net) prbfit at 
all, what should we do about the cost of the inoney capital 
and time invested when both are furpished by a proprietor/ 
manager or by donors? Is a proprietor who^ "net profit" 
covers his time at the rate of $5 per hour really making a 
profit or a loss when s/he could make $20/hour in salary? If 
ROl on the capital investment is 3% in a market which pays ' 
10% on certificates of deposit, is this "making a profit"? 
Using opportunity cost analysis, the answer is, No; but the 
usual analysis says, Yes. that's correct for thg internal 
Revenue Service, but not for employees considering a 
strijce. As usual/cost analysis turns out to be conceptually 
very complex although few people realize this; conse- 
quently serious mistakes are very common. If the buildings 
(or equipment) have been amortized' completely, should 
one deduct a slice of the eventually-necessary replacement 
cost down-payment before one has a profit? Should some 
recompense for risk (or prior losses) be allowed before we 
get to "profitatf? Cost analysis/fiscal evaluation looks pre- 
cise because ifc quantitative, like statistics, but eventually 
the conceptual/practical problems have to be faced and J 
most current definitions will give you absurd conse- 
quences, e.g. "the business is profitable, but I can't afford to 
keep it going?' ^ 

PROGRAM The general effort which marshals staff 
/ and projects towards some (often poorly) defined and 
funded goals. Program evaluation is the largest area of 
self-conscious evaluation, though product evaluation may 
be the largest area of practice Much of this book refers to 
problems in program evaluation; see, for example, Point of 
Entry, Bias Control. 

PROGRAMMED TEXT One in which the matenal is 
broken down into small components ("frames"), ranging in 
length from one sentence to several paragraphs, within 
which some questions are asked about the material, e % by 

121 

132 




leaving a blank which the reader has to fill in with the 
correct word, possibly from a set of options provided. This 
interactive feature was widely proclaimed to have great vir- 
tue in itself. It had none, unless very thorough R&D effort 
was also employed in the process of formulating the exact 
content and sequence of the frames and choices provided 
Since the typographic format does not reveal the extent of 
the field-testing and rewriting (and hence conc^ls the total 
absence of it), lousy programmed texts quickly swamped 
the market (late 50s) and showed that Gresham's Law is not 
dead. As usual, the consumers were mostly too naive to 
require performance data and the general conclusion was 
that programmed texts were "just another fad " In fact, the 
best ones were extremely powerful teaching tools, wre in 
fact ''teacher-proof" (a phrase whicfo did not endear them to 
one group of consumers), and some are still doing well 
(Sullivan/BRL reading materials, fo example). A valiant 
effort was made by a commitfee under Art Lumsdaine to set 
up standards, but the failure of all professional training 
programs to teach their graduates serious evaluation skills 
flneant there was no audience for {he standards. We shall see 
whether the new Evaluation Standards from the Stuffle- 
beam group suffer a better fate 

PROJECTS Projects are time-bounded efforts, often 
within a program. 

PROJECTIVE TESTS Tlfcse are tests with no nght ans- 
wer; the Rorschach inkblot test is a classic example, where 
the subject is asked to say what s/he sees in the inkblot The 
idea behind prrjective tests was that they would be useful 
diagnostic tools, and it seems quite possible th^t there are 
clinicians who do make good diagnoses from projective 
tests. However, the literature on the valiclify of Rorschach 
interpretations, 1 t\ those which can be expressed verbally 
as unambiguous rules for interpretations, does not establish 
substantial validity The same is unfortupately true of many 
other projective tests, which fail to show tfven test-retest 
reliability, let alone interjudge reliability (assuming that 
shared bias is ruled out by the experimental design), let 
alone predictive validity Of course, they're a lot of fun, and 
very attractive to valuephobes- both testers and testees— 
just because there are no nght answers 

ER |c '* 



PROPOSAL EVALUATION See Two-Tier. 

PROTOCOL See Evaluation Etiquette. 

PSEUDO— EVALUATION See Ritualistic Evaluation, 
Rationalization Evaluation. 

PSEUDO-NEGATIVE EFFECT An outcome or datum 
that appears show that an evaluand is having exactly the 
wrong kind of effect, whereas in fact it is not. Four paradigm 
examples are: the Suicide Prevention Bureau whose crea- 
tion is immediately followed by an increase in the rate of 
reported suicides; the school intercultural program which 
results in a sharp rise of interracial violence; the college 
faculty teaching improvement service whose clients score 
'worse than non-clients; the drug education (or sex educa- 
tion) program which leads to "experimentation/' i.e. in- 
creased use. (See text of Principles and Practices of Evalua- 
tion, Scriven (1982); for treatment of these examples.) 

PSEUDO-POSITIVE EFFECT Typically, an outcome 
which is consistent with the goals of the program, but in 
circumstances where either the goals or this way of achieving 
the goals is in fact harmful or else side effects of an over- 
whelming and harmful kind have been in i iliuitnl CUjiii 
case: "drug education" programs which aim to and get 
enrollees off marijuana and result in getting them on regular 
cigarettes or alcohol, thereby trading some reduction in 
(mostly artificial) crimes for far more deaths- from lung can- 
-^r, cirrhosis of the liver and traffic accidents. (A typical 
example of ignoring opportunity costs and side effects, i.e. 
badGBE.) 

PSYCHOMOTOR SKILLS (Bloom) Learnt muscular 
skills.yThe distinction from cognitive and affective is not 
always sharp e.g. typing Itfoks psychomotor but is highly 
cognitive as well. \ 

PSYCHOLOGICAL EVALUATION, PSYCHO-THERA- 
PEUTIC EVALUATION Particular examples of practical 
evaluation, the first often primarily taxonomical, the second 
often primarily predictive. The usual standards of validity 
apply, but are rarely checked; tte few studies suggest that 
even the reliability is very low, and what there is may be 
largely due to shared bias. The term "assessment" is often 

1^ U5 134- 

' I : 



used here instead of evaluation. 



QUALITATIVE (evaluation) A great deal of good 
evaluation (e.g. of personnel and products) is wholly or 
chiefly qualitative. But the term is sometimes used to mean 
"non-experimental" or "not using the quantitative meth- 
ods of the social sciences,/' and this has confused the issue, 
since there is a major tradition and component in evahiation 
which fits the just-quoted descriptions but is quantitative, 
namely the auditing tradition and the cost analysis compo- 
nent: What has been happening is a gradual convergence of 
the accountants and the qualitative social scientists towards 
the use of the others' methods and the use of some qualita- 
tive techniques from humanistic disciplines and low-status 
social sciences (e.g. ethnography). Obviously evaluation 
requires all this and more, and the dichotomy between 
qualitative and quantitative has to be defined clearly and 
seen in perspective or it is more confusing than en- 
lightening. 

QUALITY ASSURANCE, QUALIFY CONTROL A 

type of evaluative monitoring, originating in the product 
manufacturing area, but now used to refer to evaluative 
monitoring in the human services delivery area. This kind 
of evaluation is formative in the sense that it is run by the 
staff responsible for the product, or their supervisors, but it 
js the kind of formative that is essentially "early-warning 
summative," because one is endeavoring to ensure that the 
product, when it reaches the consumer, will appear to be 
highly satisfactory from the consumer's point of view. Thus 
quality control is not at all like a common type ot evaluative 
monitoring, which is checking on whether the project is on 
target; that is a form of goal-based evaluation. Quality con- 
trol should be consumer-oriented evaluation, i e at least 
supplemented by goal-free, or needs-based evaluation. 

QUANTITATIVE (evaluation) Usually refers to the 
use of numerical measurement and analvsis methodology 
from social science or (rarely) accounting. Cf . Qualitative. 

QUARTILE (Stat.) See Percentile. 

QUASI-EXPERIMENTAL DESIGN (Term due to 
Donald Campbell) When we cannot actually do a random 
allocation of subjects to the control and experimental groups, 



or cannot arrange that all subjects receive the treatment at 
the same time, we settle as next best for quasi-experimental 
design, where we try to simulate a true experimental design 
by carefully picking someone or a group for the "control 
group" (i.e., those who did not in fact get the primary 
treatm^it) who very closely matches the experimental per- 
son/group. Then we study what happens to and perhaps 
t|st our "experimental" and "cortrol" groups just as if we 
had set them up randomly. Of course, the catch is that the 
reasons (causes) why the expenmental group did in fact get 
the treatment may be because they are different in some 
way that explains the difference in the outcomes (if there is 
such a difference), whereas we— not having been able to 
detect that difference— will think the difference in outcome 
is due to the difference in the treatment. For example, 
smokers may, it has been argued, have a higher tendency to 
lung irritability, an irritation which thev find is somewhat 
reduced, in the short run by smoking; and it may be this 
irritability, not smoking, that yields the higher incidence of 
lung cancer. Only a "true experiment" could exclude this 
possibility, but that would probably run into moral prob- 
lems. However, the weight and web of the quasi-experi- 
ments in cancer research has virtually excluded this possi- 
bility See Ex Post Facto 

0UEMAC Acronym forVn approach to metaevalua- 
tion introduced by Bob Gowin, a philosopher of education 
at Cornell, which emphasizes the identification of unques- 
tioned assumptions in the design. (Questions, Unques- 
tioned Assumptions, Evaluations, Methods, Answers, 
Concepts.) 

QUESTIONNAIRES The basic instrument for surveys 
and structured interviews. Design of them takes major skill 
and effort. Usually too long, which reduces response rate as 
well as validity (because it encourages stereotyped, omit- 
ted, or superficial responses.) Must be field-tested; usually 
a second field-test still uncovers problems e.g. of ambi- 
guity. Interesting problems arise with respect to evaluation 
questionnaires e.g. what type to use in personnel evalua- 
tion when the average response turns out to be a 6 on a 
7-point scale, providing inadequate upside discnmination 
V)ne can use stranger anchors; or rephrase as a ranking 

E5& • " ' r 136 



questionnaire; or ir.ipose grading-on ihe-p <v- x ^-sort) 
methodology/ by putting limits on the nnm^ >i atf* vable 
7's or 6's from any ono respondent; or pre vin * uonary 
instructions or syst ns The first and last of £ . . introduce 
less distortion where merit levels really are high; the U.S. 
Air Force once ran into a minor rebellion when it adopted 
the third alternative. See also Rating Scales, Symmetry. 

RANDOM A "primitive" or ultimate concept ot statis- 
tics and probability, i.e , one that cannot be defined in tern. J 
of anv other except circularly. Texts often define a random 
sample from a population as one picked in a way that gives 
every individual in the population an equal probability of 
being chosen; but one can't dfcnne "equal probability" with- 
out reference to randomness or a cognate. A distinctly tricky 
notion. It is not surprising that the first three "tables of 
random numbers" turned out to have been doctored by 
their authois. Although allegedly generated (in completely 
different ways)— bv mechanical and mathematical proce- 
dures — which met the definition just given, they were doc- 
tored into non-randomness, e g. because pages or columns 
which held a substantial preponderance of a particular digit 
or a deficit of one particular digit-parr were deleted, where- 
as of course such pages must occur in approximations to a 
complete listing cf all possible combinations. No finite table 
can be random bv the preceding definition The best defini- 
tion is relativistic and pragmatic; a choice is random with 
regard to the variable X if it is not significantly affected by 
variables that significantly affect X. Hence a die or cut of 
^.ards or turn of the roulette wheel is random with regard to 
the interests of the players if the number that comes up is 
caused to do so bv variables which are not under the influ- 
ence of the plovers' interests Tables o< random numbers are 
random onlv with respect to certain kinds of bias (which one 
should state) and certain ranges of sample size. 

RANKING,, RANK-ORDERING Placing individuals 
in an order, usually of merit, on the basis ot their relative 
performance on (typically) a test or measurement or obser- 
vation Full ranking does not allow ties i e two or more 
individuals with the same rank ("equal third"), partial rank- 
ing does, it mav then, in the limit cases, v\here there are 
large numbers of ties and a small number of distinct groups, 

ERIC 137 



not be different from grading. 

RATING Usually same as grading. 

RATING SCALES Device for standardizing responses 
to requests for (typically evaluative) judgments. There has 
been some attempt in the research literature to idef*,tijy the 
ideal number of points on a rating scale. An even number 
counteracts the tendency of some raters to use the midpoint 
for everything by forcing them to jump one way or the 
othei^ on the other hand, it eliminates what is sometimes 
tlte^only correct response. Scales with 10 or more points 
generally prove confusing and drop the reliability; with 3 or 
less (Pass/Not Pass is a two point scale), too much informa- 
tion is thrown away Five- and (especially) seven-point 
scales usually work well. It should be noted that the A-F 
scale is semantically asymmetrcal when used with the 
usual anchor points i.e. it will not give a normal distnbution 
(in the technical sense) of grades for a population in which 
talent is normally distributed.) With + and - and fence- 
sitting supplements (A+, A,A-,AB,B+,B,B-,BC . .), it 
runs to 19 points and with thedouWe + (double -), ithas29 
points and the refinements become esst. 4 aally ritualistic. 
Note that the translation of letter grades into numbers, e.g. 
for purposes of computing a graded point average, involves 
assumptions about the equality of the intervals (of merit} 
between the grades, and about the location of the zero 
point, which are usually not met (LE). See also Question- 
naire, Point Constancy Requirement. 

RATIONALITY See Optimization. 

RATIONALIZATION Pseudo-justifications, usually 
provided ex post facto. See Consonance. 

RATIONALIZATION EVALUATION An evaluation 
is sometimes performed in order to provide a rationalization 
for a predetermined decision. This is much easier than it 
might appear, and a good many managers know ery well 
how to do it. If they want a program shot down, they hire a 
gunslinger, if they want one praised or protected, they hire 
a sweetheart, Every now and again evaluators are brqught 
in by clients who have got them ink) the wrong category and 
the earlv discussions are likely to be embarassing, annoving 



129 

13S 



/ 

or amusing, depending pon how badly you needed the 
job. (Suchman's "whitewash" was an example.) 

RAW SCORES The actual score o?f a test, before it is 
converted tntp percentiles, grade equivalents, et£. 

R&D Research and Development; the basic cyclic (iter- 
ative) process of improvement, e.g. of educational materials 
or consuiper products: research, design and prepare, pilot 
run, investigate (evaluate) results, design improvements, 
run improved version, etc. It is normally an evaluative scort 
since the points are usually given for the merit of an answer, 
which means that much data (e.g. raw scores) is evaluative, 
i.e. the qu* .ititative science rests on the qualitative dis- 
cipline. 

RDD&E Research, Development, Diffusion (or Dis- 
seminata!) and Evaluation. A more elaborate acronym for 
the development process. 

REACTIVE EFFECT A phenomenon due to (an artefact 
of) the measurement procedure used: one species of evalua- 
tion or investigation artefact. It has two sub-species, con- 
tent-reaction effects and process-reaction effects. Evalua- 
tion-content reactions include cases where a criticism in a 
preliminary draft of an evaluation is taken to heart by the 
evaluee and leads to instant improvement, thereby "invali- 
dating" the evaluation. Evaluation-process reactions in- 
clude cases where the mere occurrence{br ^ven the pros- 
pect) of the evaluation materially affects the behavior of the 
evaluee(s) so that the assessment to be made will not be 
tvpical of the program in its pre-evaluated states. Process 
reactivity is thys content-independent. Although reactive 
measurement* have not previously been thus sub-divided, 
the distinction does apply there and not just to evaluation;' 
but it is less significant. In both cases, unobtrusive ap- 
proaches may be appropriate to avoid process-reactivity; 
but on the other hand openness may be required on ethical 
grounds. The openness may be with respect to content tfr 
with respect to process or both. See Reasons for evaluation. 
Example: Hawthorne Effect. 

REASONS FOR EVALUATION Twc common reasons 
are to improve something (formative evaluation) and to 
make various practical decisions almit something (summa- 




tiye evaluation). Pure interest in determining the merits of 
something is another kind of sumpiahve evaluation. (Lin- 
coin's evaluation of the five generals from which he chose 
Ulysses S. Grant is summative evaluation of the applied 
kind (decision-oriented); contemporary historian's evalu- 
ation of them is pure evaluation research (conclusion* 
oriented), i.e. of the second kind. The two are different only 
in role, net intrinsically.) There* are also what might be 
called content-independent reasons for doing evaluation 
e.g. as a rationalization or excuse (for a hatchet job or for 
funding a favorite) or for motivation (to work more carefully 
or harder). In the excuse case, the general nature of the 
evaluation's content must be known or arranged in advance 
e.g. by hiring a known "killer" or "sweetheart." A ritual 
evaluation is done only because it is required. Other rea- 
sons, not wholly independent of the above^are for account- 
ability, advocacy, political advantage, and postponement 
(Suchman) i.e. to gain time 

RECIPIENTS. See Coni, -<er, Recoil Effects. 

RECOMMENDATIONS In a trivial sense, an evalua- 
tion involves an implicit recommendation— that the evab- 
and be viewed/treated in the way appropnate to the value it 
was determined to have by the evaluation. But in the specif- 
ic sense often assumed ;j be appropnate where "recom- 
mendation" is taken tomeai "remedvl actions" evaluations 
may not lead to them even if cie . igned :>o as to do so (which is 
much more costly.) That repudiation recommendations are 
not always possible, eve .vhen evaluation is possible, is 
obvious in medicine and product evalu-Hon; but because 
the logic has not been well thought out, it is widely sup- 
posed to be a sign of bad design or an absence of humamtv 
when personnel or program evaluations do not lead to 
them— or even when they do not lead toguaranteed-to-be- 
successfui recommend tirns. But there are some people 
who are irremedtat !v incompetent at a given complex task ' 
e g teaching in a "war zone" school, and not even the 
progress of soence will al'er that qualitative fact though it 
may alter th* peicentage th.it can be trained It is a very 
grave design decision in evaluation to commit a design to 
producing remedial suggestions, just as it is to undertake to 
discover explanations, it may increase cost and the chance 



in 

140 



of failure by 1000 percent. But most evaluations spin off a 
few useful recommendations without much exfra effort. 
The main obstacle to doing more than this is that remedia- 
tion requires* not only local knowledge but very special 
skills. A road-tester is not a mechanic; an evaluator is not a 
management trouble-shooter. 

RECOIL EFFECTS When a hunter shoots a deer, he 
(sic) sometimes bruises his shoulder. Programs affect their 
staff as well as the clientele. The effect is of secondary 
importance compared to what happens to the deer or the 
clientele, bui must be included in program evaluation. The 
staff is impacted, but is not a recipient population. 

REGRESSION TO THE MEAN You may have a run of 
luck in roulette, bnt it won't List; your success ratio will 
regress (drop back) to the mean. When a group of subjects is 
selected for remedial work on the basis of low test scores, 
some of them will have scored low only through "bad luck," 
i . e . , the sampling of their skills yielded by (the items on) this 
test is in fact noi typical. If they go through the training and 
are retested, they will score better simply because any sec- 
ond test would (almost certainly) result in theij- displaying 
their skills more impressively. This phenomenon gives an 
automatic but phony boost to the achievements of "perfor- 
mance contractors" if they are paid on the basis of improve- 
ment by the low-scorers. If they had to improve the score of 
a random sample of students, regression down to the mean 
would offset the regression up to the mean we have just 
discussed. But they are normally called in to help the stu- 
dents who "need it most" and picking that group by testing 
will result in including a number who do not need help. (It 
will also exclude some who do. ) Multiple or longer tests or the 
addition of teache" (expert judge) evaluations reduce this 
sou ice of error. 

RELATIVISM/SUBJECTIVISM Roughly speaking, 
the view that there is no objective reality about which the 
evaluator is to ascertain the truth, but only various perspec- 
tives or approaches or responses, amongst which selection 
is fairly arbitrary or is dependent upon aesthetic and psy- 
chological considerations rather than scientific ones. The 
contrary point of view would naturally be referred to as 



141 »2 



absolutism or objectivism; in one technical sense used in 
philosophy the opposite of subjectivism is called the doc- 
trine of realism. The fundamental logical fallacy that con- 
founds many discussions of this issue is the failure to see the 
full implications of the fact that relativism is a self-refuting 
doctrine, i.e., "relativism is true" can be no more true than 
"relativism is false/' and hence relativism can hardly repre- 
>^ent a Great Truth, since it is self-refuting. One very im- 
/ portant implication of this point for evaluation practice is as 
follow^: in a situaHon where a number of different ap- 
proaches, methodologies or perspectives on a particular 
program (for example) are possible and all are about equally 
plausible, it does not follow that any one of them would 
constitute a defensible evaluation. The only thing that fol- - 
lows is that giving all of them and the statement that all of 
them are equally defensible, would constitute a defensibjg 
evaluation. The moment that one has seen that general 
alternative approaches are equally well-justified, although 
they yield incompatible results, one has seen that no one of 
these can be thdught of as sound in itself, just because the 
unqualified assertion of any one of them implies the denial 
of the others and that denial is, in such a case, illegitimate. 
Hence the assertion of any one of them by itself is illegiti- 
mate. If, on the other hand, the different positions are not 
incompatible, then thev must still be given in order to pre- 
sent a comprehensive picture of whatever is being evalu- 
ated. In neither case, then, is giving a single one of these 
perspectives defensible. In short, the great difficulties of 
establishing one evaluation conclusion by comparison with 
others cannot be avoided by arbitrarily picking one, but 
only by proving the superiority of one or including ail as 
perspectives, a term which, correctly used, impliesthe ex- 
istence of a reality which is only partly revealed in each 
view. Thus it converts incompatible reports into com- 
plementary ones i.e. it converts relativism into objectivism. 
Merely giving several apparently incompatible accounts in 
an evaluation is incompetent; showing how they can be 
reconciled i.e. seen as perspectives, is also required. (Or else 
a proof that one is right.) The presupposition that there is a 
single reality is not an arbitrary one, any more than the 
assumption that the future will be somewhat like the past is 
arbitrary; these are the best-established of all truths about 

ERJC 



the world. .•Determinism was equally well-established and 
/ we have now had to qualify it slightly because of the Uncer- 

tainty Principle. We have not yet encountered good reasons 
for qualifying the assumptions of realism and induction (the 
technical names for the two previously mentioned ) 

As the practical end of these considerations, it must be 
recognized that even evaluations ultimately based on "mere 
preferences'' may still be completely objective. One must 
distinguish sharply between the fact that the ultimate basis 
of merit in such cases is mere preference, on which the 
subject is the ultimate source of authority, and the fallacy of 
Supposing that the subject must therefore be the ultimate 
source of authority about the merits of whatever is being 
evaluated. Even in the domain of pure taste, the subject 
may simply not have researched the range of options prop- 
erly, or avoided the biassing effects of labels and advertis- 
ing, or recommendations by friends, so the evaluator may 
be able to identify critical competitors that outperform the 
subject's favorite candidate, w terms of the subject's aicti taste. 
And of course identifying Best Buys for an individual in- 
volves a second dimension (cost) which the evaluator U 
often able to determine and combine more reliably than the 
amateur, The moment we move the least step from areas 
where superiority is unidimensional, instantaneous, and 
entirelv taste-dependent — essentially the pre-evaluative 
areas, since it's not superiority but mere preference that is 
being judged — then we find the subject beginning to make 
errors of synthesis in putting together two or three dimen- 
sions of preference (halo or sequencing effects, for ex- 
ample), or in extrapolating to continued liking; errors that 
art evaluator can reduce or eliminate by appropriate experi- 
mental design, often leading to a conclusion quite different 
from that which the subject had formed. One step further 
away, and we find the possibility of the subject making 
first-level errors of judgment, e.g. about what they need (or 
even what they want) by contrast with what they like, and 
% these can certainly be reduced or eliminated by appropnate 
evaluation design In the general case of the evaluation of 
consumer goods, the question of whether one can identify 
"the best" product with complete objectivity, despite a sub- 4 
stantial range of different interests and preferences at the 
basic level by the relevant consumer group, is simply < 

ERIC 143 m 



question of whether the interproduct variations in perfor- 
mance outweigh the interconsumer variations in prefer- 
ence. Enormous variations in preference may be completely 
blotted out by the tremendous superiority of a single prod- 
uct over another, such that it "scores" so much on several 
dimensions which are accorded significant value by all the 
relevant consumers, that even the outlandish tastes 
(weightings) of some of the consumers with respect to some 
of the other dimensions cannot elevate any of the competi- 
tive products to the same level of total score, even for those 
with the atypical tastes. Thus huge interpersonal differ- 
ences in all the relevant preferences do not demonstrative 
relativism of evaluations which depend on them. See Per- 
spectival Evaluation, Sensory Evaluation. 

RELIABILITY (Stat.) Reliability in the technical sense is 
the consistency with which an instrument or person mea- 
sures whatever it is designed to assess. If a thermometer 
always says 90 degrees Centigrade when placed in boiling 
distilled water at sea-level, it is 100% "reliable," though 
inaccurate. It is useful to distinguish test-retest reliability 
(the example just given) from interjudge reliability (which 
would be exhibited if several thermometers gave the same 
reading). There are many psychological tests which are 
test-retest reliable but not interjudge (i.e., inter-adminis- 
trator) reliable: the reverse is less common. In the everyday 
sense, reliability means the same as the technical term val- 
idity; we'd say that a thermometer which reads 90o C when 
it should read 100o C wasn't very reliable. This confusing 
situation could easily have been avoided by using the term 
"consistency" instead of introducing a technical use of ' re- 
liability" but that was in the days when jargon was thought 
to be a sign of scientific sophistication. As it is, reliability is a 
necessary but not a sufficient condition for validity, hence 
worth checking first since in its absence validity can't be 
there. (There is, unfortunately, a hvper-techfiical exception 
to this.) 

RELIABILITY (of evaluation) A largely unknown 
quantity, easily obtained by running replications of evalua- 
tions; either serially or in parallel. The few data on these 
make clear that reliability (apart from spurious effects such 
as shared bias) is not high. The use of calibration exercises 



and checklists and trained evaluators can improve this 
enormously . Paul Diederich showed how to do this with the 
evaluation of composition instruction, and that paradigm 
generalizes. ^ 

REMEDIATION A specific recommendation for im- 
provement, characteristic of— and certainly. desirable in- 
formative rather than summative evaluation. But formative 
can be useful without any remediation suggestions, and it is 
in general more difficult (sometimes completely impossible) 
and more expensive if it aims for remediation. See also 
Recommendation. 

REPLICATION A very rare phenomenon in the ap- 
plied human studies fields, contrary to reports, mainly be- 
cause people do not take the notion of serious testing for 
implementation (e.g. through the use of an index of im- 
plementation) as an automatic requirement on any sup- 
posed replication. Even the methodology for replication is 
poorly thought out; for example, whether the replicator 
should have any detailed knowledge of the results at the 
primary site. Such knowledge is seriously biasing— on the 
other hand, it significantly simplifies the preparations for 
ranges of measurement, etc. It is probably quite important 
to arrange at least some replications where the (e.g.) pro- 
gram to be replicated is simply described in operational 
terms, perhaps with the incidental remark that it has shown 
"promising results" at the primary site. 

REPORT WRITING/GIV ^ One of several areas in 
evaluation where creativity and originality are really im- 
portant, as well as knowledge about diffusion and dissemi- 
nation. Reports must be tailored to audience as well as 
client needs and may require a minor needs assessment of 
their own. Multiple versions, sometimes using different 
media, as well as different vocabularies, are often appro- 
priate. Reports are products and should be looked at in 
terms of the KEC— field-testing them is by no means inap- 
propriate. Who has time and resources for all this? ff de- 
pends whether you are reallv interested in implementation 
of the evaluation. Would you write it in Greek? No, so why 
assume that you are not writing it in the equivalent of Greek, 
as far as your audiences are concerned? 



ERLC 



m 

145 



RESEARCH The general field of disciplined investiga- 
tion, covering the humanities, the sciences, jurisprudence, 
etc. Evaluation research is one subdivision, there is no way 
to distinguish other research from evolution (apart from 
content) except by distorting one or the other. "Evaluation 
research" is usually just a self-important name for serious 
evaluation; it would be better used to reft- r to research on 
evaluation methodology, or research that pushes out the 
frontiers of evaluation, or at least research that involves 
considerable investigatory difficulty or originality. Cf. per- 
forming arts vs. creative arts. 

RESEARCH INTEGRATION, RESEARCH SYNTHE- 
SIS See Meta-analysis. 

RESEARCH EVALUATION Evaluating the quality 
and/or value and/or amount of research (proposed or per- 
formed) is crucial for e.g. funding decisions and university 
personnel evaluation. It involves the worth/merit distinc- 
tion— "worth" here refers to the social or intellectual pay- 
offs from the research, "merit" to its intrinsic (professional) 
quality. While some judgment is always involved, that is no 
excuse for allowing the usually wholly judgmental process; 
one can quantify and in other ways objectify the merit and 
worth of almost all research performances to the degree re- 
, quisite for personnel evaluation. (The solution does not lie in 
the use of Citation index. ) 

RESPONSE SET Tendency to respond in a particular 
way, regardless of the merits of the particular case. Some 
respondents tend to rate "everything veiy high on a scale of 
merit, others rate everything low, and yet others put every- 
thing in the middle. One can't argue out of context that such 
patterns are incorrect; there are plenty o< situations in which 
those are exactly the correct responses. When we're talking 
about response set, however, we mean the cases where 
these rigid response patterns emerge from general habits 
and not from well thought-out consistency. 

RESPONSIBILITY EVALUATION Evaluation that is 
oriented to identification of the responsible person(s) or the 
degree of responsibility, and hence usually the degree of 
culpability or ment. Responsibility has causality as a neces- 
sary but not a sufficient condition. Culpability similarly 

ERJC 1,7 HQ 



0 

ERIC 



presupposes responsibility but involves further co 'ttions 
from ethics. Social scientists like most people not tr tJ ed in 
the law or casuistry are typically totally confused about such 
issues e.g. supposing that certain evaluations shouldn't be 
done (or published) because "they may be abused " The 
abuse is culpable; but so is failure to publish (professional- 
quality work of some prima facie intellectual or social value) 
(e.g. the Jensen case) A different kind of example involves 
keeping really bad teachers on in a school district because the 
alternative of attempting dismissal-involves effort, is un- 
popular with the union, and usually unsuccessful. The re- 
sponsibility is to the pupils who are sacrificed at the rate of 
30 per annum per bad teacher; and that responsibility is so 
.enous that you (the superintendent or the board) have to 
try for removal because (a) you may succeed, (b) the effects 
mav be on balance good e.g. there may be a gain in overall 
motivation even if you lose the case, (c) you may learn how 
to do it better next time. The evaluation of schools should 
(normally) only be done in terms of the variables over which 
the school has control. In the short run and often in the long 
ryn, this does not include scores on standardized tests. (See 
SEP). Theevaluationof evaluations should never be done in 
terms of results, because the evaluator is no,t responstbteior 
implementation; but it should be done in terms of results if 
implemented. Ret Primary Philosophy, Scriven, McGraw- 
Hill 1%6 

RESPONSIVE EVALUATION Bob Stake's current ap- 
proach, which contrasts with what he calls "preordinate" 
evaluation, where there is a predetermined evaluation de- 
sign. In responsive evaluation, one picks up whatever turns 
up and deals with it as seems appropriate, in the light of the 
known and dntolding interests of the various audiences. 
The emphasis is on rich description, not testing. The risk is 
of course a lack of structure or of valid proof, but the trade- 
off is the avoidance of the nsk of a preordinate evaluation— 
a rigid and narrow outcome of little interest to the audi- 
ences. See Evaluation, Evaluation-Specific Methodology, 
Naturalistic Evaluation, Perspectival Evaluation, Rela- 
tivism. 

RETURN ON INVESTMENT (ROD One of the mea- 
sures of merit or worth in fiscal evaluation, usually quoted 

147 



138 



i 



as a per annum percentage rate 

RISK(S), EVALUATING The classic expectancy ap- 
proach in which the products of the probability of each 
outcome by its utility are compared, thus converting the 
two .dimensions of risk and utility into the one (of expec- 
tancy),^ conventionally said to have certain weaknesses. 
> For example, it ignores the variable value of n§k itself to 
different individuals; the gambler like., it, many others seek 
to minimize it. However, one can put in a nsk-utility func- 
tion. Sometimes expectancy analysis is criticized on the 
grounds that "people don't thnk that way " The discussion 
of minimax or satisf icing strategies is often introduced as a 
step towards "a more sophisticated analysis of decision- 
making/' This kind of remark often confuses descriptive 
science with normative science. Minimax and satisfying are 
simply less sophisticated methods of making decisions, 
though they may be more common, and hence appropriate 
objects for study by descriptive scientists (and the attendant 
jargon). "Risk management" is a topic that has begun to 
appear with increasing frequency in planning and manage- 
ment training curricula. One reason that evaluations are not 
implemented is because the evaluator has failed to see that 
risks have different significance for implementers by con- 
trast with consumers; a program or policy (etc.) which 
{ should be implemented, in terms of its probable benefit to 
the consumers may be one which carries a high risk for the 
implementers, because their reward schedule is often radi- 
cally different from that of the consumer (usually as a result 
of bad planning and management at a higher level ) Two 
classic examples are the classification of documents as Top 
Secret, and. the hiring of personnel about whom there is a 
breath of suspicion. In each situation, the implementer gets 
zapped bv review panels exercising 100 percent hindsight 
after a disaster if there is the least trace of a negative indi- 
cator, and in neither case is there ever a reward for taking a 
reasonable nsk— in fact, there's never a review panel to 
look at the big winners Consequents, the public's utilities 
are not optimized and are often reversed. The present^ 
y political-plus-media environment in the U.S. may be one in 
which the risk configuration for the road to the Presidency' 
(or the legislature) is so different from that required to do 
the job right as to guarantee the election of poor incumbents 

148 • 



who were great candidates. 



RIGHT-TO-KNOW The legal domain of impacted 
populations' access to information; much increased in Car- 
eer period e.g. through "open file" legislation. Decreased in 
Reagan period. 

RHETORIC, THE NEW The title of a book by C Perel- 
man and L. Olbrechts-Tyteca (Notre Dame, 1969), which 
attempted to develop a new logic of persuasion, reviving 
the spirit of pre-Ramist efforts. (Since Ramus (1572), the 
view of rhetonc as the art of empty and illogical persuasion 
h^s been dominant; the concept of "logical analysis," as 
separate from rhetoric is Ramist.) This ajea is of the greatest 
importance to evaluation rnethodology^s Ernest House has 
stressed (e.g. in Evaluating with Validity, Sage, 1980), be^ 
cause of the extent to which evaluations have — whether 
intentionally or not — the function of persuasion and not 
just reporting. The New Rhetoric emerged from the context 
of studying legal reasoning where the same situation ob- 
tains and was poorly recognized; The same push for reap* 
praisal and new models has occurred in logic (see Informal 
Logic, eds. Blair and Johnson, Fdgepress, 1980)*, and in the 
social sciences with the move towards naturalistic method- 
ology. It is all part of the backlash against neo-positivist 
philosophv of science and the worship of the Newtonian/ 
mathematical model of science. Evaluation's fate clearly lies 
with the new movements. * 

RIPPLE EFFECTS Seegnckle Ef fects. 

RITUAUISTIC) or SYMBOLIC EVALUATION One 

of the reasons for doing evaluation that has nothing to do 
with the content of the evaluation (and hence is unlike 
formative and summative— or rationalization — evalua- 
tion) is the ritual function i.e. the doing ^f an evaluation 
because it is required, although nobody has the faintest 
intention of either doing it well or taking any account of 
what it says. E valuators are quite often called in to situations 
like this, although they may not even be recognized as cases 
of ritual evaluation by the client. (Evaluation in"the bilingual 
education area is currently- mostly ritualistic.) It is an im- 
portant part of the preliminary discussions in serious evalu- 
ation to get clear exactly what kind of implementation is 

ERIC 14<j 



<7 



planned, under various hypotheses about what the content 
of the evaluation report might be; unless, of course, you 
have time to spare, need the money, andare not misleading 
any remote audiences. The third condition essentially never 
applies. See also Motivational Evaluation, Reactive Effect. 

ROBUSTNESS (Stat.) Statistical tests and techniques 
depend to varying degrees on assumptions especially about 
the population of origin. The less they depend on such 
assumptions, the *>^re robust they are. For example, the 
t-test assumes normaUty, whereas non-pafametric ("dis- 
tribution-free") statistics are often considerably more 
robust. One might translate "robust" as "stable under vari- 
ation of conditions." The concept is also applicable to and 
most important in the evaluation of experimental designs 
and meta-evaluatiofi. Designs should be set up to give 
definite answers to at least some of the most important ques- 
tions no matter how the data turns out, a natter quite 
different from their cost-effectiveness, power, or elegance 
(the latter is a kind of limit case of efficiency or power.) 
Evaluations should be set up so as to "go for the jugular" i.e 
get an adequately reliable answer to the key evaluative 
question(s) first, adding the trimmings later // nothing goes 
^wrong with Part One. this affects budget, staff and time-line 
planning. And it has a cost as does robustness in statistics; 
for example, robust approaches will not be maximally ele- 
gant or cost-effective if everything goes right. But meta- 
evaluation will normally show that a minimax approach is 
called for, which means robust evaluation. 

ROLE (of evaluator) The evaluator plays more roles 
than Olivier, or should. Major ones include-fherapist/ 
confessor, educator, arbitrator, cb-autho^"the enemy/' 
trouble-shooter, jury, judge, attorney. I 

RORSCHACH EFFECT An extreme)/ complex evalu- 
ation, if not carefully and rationally synthesized into an 
executive summary report, provides a confustne mass of 
positive and negative (comments, and the unskilled and/or 
strongly biased client can easily project onto ( "see in," 
rationalize from) such a backdrop whatever perception s/he 
originally had. 

RUBRIC Scoring or grading or (conceivably) ranking 

Er|c ■ HI 150 



kev for a test. Term originated in the field of evaluation ot 
student compositions and commonly refers to a kev\for 
grading them or other essay answers. \ 

SALIENCE SCORING The practice of requesting re- 
spondents (e.g. when rating proposals) to use only those 
scales which, they felt, most significantly influenced them. 
It focuses attention on the most important features of what- 
' ever is being rated, and it greatly reduces processing time 

SATISFICING Herbert Simon's term for a common 
management policy of picking something acceptable rather 
than the "best choice" (optimizing). See Risks. 

SCAL"° See Measurement. 

SCOPfc WORK This is the part of an RFP or a pro- 
posal which describes exactly what is to be done, at the levol 
of description which refers f o the activities as they might be 
seen bv a visitor without special methodological skills or 
insight, rather than to their goals, achievements, process oV 
purpose. In point of fact, scope of work statements tend to 
drift off into descriptions that are somewhat less than obser- 
vahonallv testable The scope of work statement is an im- 
portant part of making accountability possible on a contract, 
and is therefore an important part of me specifications in an 
RFP or a proposal 

SCORING Assigning numbers to an evaluand (usually 
a performance), usuallv from an interval scale i.e. one in 
which the points all have equal value Sometimes numbers 
are used as grades^tthout comtrutment to point constancy, 
but this is misleading— letters shoujfi be used instead, and 
the attempt to convert them to numbers e.g to calculate 
CPAs should be protested unless point constancy holds at 
least to an approximation that will not yield errors (LE ) 
Usuallv tests should be impressionistically graded as well as 
scored, both to get the cutting scores and to . vide insur- 
ance against deviations from point constancy. Scoring not 
onlv requires point constancy but also s*?rious consideration 
or the definition of a zero score, no answ <m? hopelessly bad 
answer* both"* ("both" is a hopelesslv bad answer ) See 
Raw Scores, Grading, Ranking, Anchoring. 

SECONDARY ANALYSIS Reassessment ot rt p expert- 
ERIC 15] M2 



ment or investigation, either by reanalysis of the data or 
reconsideration ot the interpretation Gathering new data 
would normally constitute replication; but there are inter- 
mediate cases. Sometimes used to refer to reviews of large 
numbers of studies; See Meta-analysis, Secondary Evalu- 
ation. 

SECONDARY EVALUATION (Cook)' Reanalysis of 
original— or original plus n w— data in order to produce a 
new evaluation of a particular project (etc ). Russell Sage 
Foundation commissioned a series of books in which fam- 
* ous evaluations were treated in this way, beginning with 
Tom Cook's secondary evaluation of Sesame Street. Ex- 
tremelv important because* (a) it gives potential clients 
some basis for estimating the reliability of evaluators (in the 
case just cited, the estimate would be faHy low), (b) it gives 
evaluators the chance to identify and learn from theii; mis- 
takes. Evaluations have all too often been fugitive docu- 
ments and hence have not received the benefit of later 
discussion in "the literature" as would a research report 
published in a journal; a weakness in the field. (Similar 
problem apphp' to classified material). Cf Metaevaluation. 

SECRET CONTRACT BIAS In proposal, personnel, 
and particularly institutional and trainmg-pro^am evalua- 
tion, raters are often too lenient because they know that the 
roles will be reversed on another occasion and they think or 
intuit that if everyone sees that, and acts accordingly, "well 
all come out smelling like roses." Typical unprofessional 
conduct ivpical of the professionV A good counterbalance 
would be to rate everyone on thpidhg-tenn validity of their 
ratings. A more feasible controhs the use of external and/or 
general-purpose evaluators See Accreditation. 

SELECTIVITY BIAS. Arises in program evaluation 
when selection of control or experimental group members is 
influenced by an unnoticed connection with desirable out- 
comes Irrelevant to studies with random assignment. If 
differential attrition occurs, as between experimental and 
control group, the possibility of selectivity bias reemerges 
even in the randomized design (See "Issues in the Analysis 
of Selectivity Bias/' Barnow et al , Evaluation Studies Re- 
vieiv Annual, v S, 1980. Sage) 



152 



SEMI-INTERQUARTILE RANGE (Stat ) Half the in- 
terval between the score that marks the top score of the 
lowest or first quartile (i.e. of the lowest quarter of the group 
being studied, after they have beerv ranked according to the 
variable of interest, e.g. test scores), and the score that 
marks the top of the third quarhle. This is a useful measure 
of the range of a variable in a population, especially when it 
is not a normal distribution (where the "standard devia- 
tion" would usually be used). It amounts to averaging the 
intervals between the median and the individuals who are 
halfway out to the ends of the distribution, one in each 
^ direction Thus it is not affected bv oddities occurring at the 

extreme ends of the distribution, its main advantage over 
the standard deviation, an advantage which it retains even 
in the case of a normal population. 

SENSORY* EVALUATION Wine-tasting when done 
scientifically, the better restaurant reviews, the Consumers 
Union report on bottled water, remind us of the important 
difference between dismissing something as a "mere matter 
of taste" and doing sensory evaluation which does not 
eliminate dependence on preference but improves the relia- 
bility of the judgments of preferences, and improves the 
evaluative inference e g. by eliminating distractors (such as 
labels), using multiple independent raters and standar 
dized sets of criteria 

SERVICE EVALUATION Ret Service Evaluation, 
Vol 1, no 1, Fall 1981, Center for che Study of Services, 
Suite 406, 1518 K St , N W , Washington, D C 20005 

SEP>j6ch(H>l Evaluation Profile) An instrument for 
evaluating the performance of schools (and hence districts, 
principals etc ), which looks only at those variables the 
school controls See Responsibility Evaluation. 

SEQUENCING EFFECT The influence of the order of 
items (tests etc ) upon responses A test's validity may be 
compromised when items are removed e g for racial bias, 
since the item might have preconditioned the respondent 
(in a way that has nothing to do with its bias) so as to give a 
different and more accurate response h> the next (or any 
later) question, an example of sequencing effect 

SES Socio Economic Status 

ERIC 1:5J 144 • 



SHARED BIAS The nrmcipal problem with using 
multiple expert opinion for /alidationof evaluations is that 
the agreement (if any) may be due to common error; obvi- 
ous and serious examples occur fa peer review of research 
proposals, where the panelists t*>nd to reflect current fads in 
the field to the detriment of innovators, and in accredita- 
tion. The best antidote is often the use of intellectually and 
not just institutionally external jude^^g, radical critics of 
the field The inference from reliability to validity must 
bridge the chasm of shared bias. The meaning is self-evi- 
dent, the applications are not. Shaied bias is the main 
reason why interjudge or interiest consistency i.e. reliabil- 
ity in the technical sense, is no substitute for validity. A 
typical example occurs in- accreditation, when the driver 
education department is checked (e.g. of a high school) by 
the driver ed person on the visiting team. There is little 
solace to be found in the discovery that: (a) they both like the 
department, and the visitoi does not recommend its aboli- 
tion—although there are very serious reasons for dropping 
such departments when money is tight e g. they do not 
reduce accident rates, (b) a second visiting panel agrees witfi 
the first one's judgment on driver ed (because its judgment 
was formed bv one more member of a self-serving group) 
See Bias Control. ^ 

^ ,En v£ FECTS Side effects are the unintended good 
and bad effects of the program or product being evaluated. 
Sometimes the term refers to effects that were intended but 
ye not part of the goals of the program e.g employment of 
staff In either case, they may or may not have been ex- 
pected, predicted or anticipated (a minor point). In the Key 
Evaluation Checklist a distinction is made between side 
effects and standard effects on impacted non-target popula- 
tions, i e s\iie-}H>intlations, but both are often called side 
effects 

— 

SIGMA (Stat ) Creek symbol JiMJ A name for Stan- 
dard Deviation. \ \ 

SIGNIFICANT, SIGNIFICANCE Th^overalL svnthe- 
si/ed conclusion of an evaluation, mav ietate to social or 
professional or intellectual significance/stahstical signifi- 
cance, when relevant at all, is usually o/ie of sWeral neces- 
sary conditions for real significance. The significance of an 



intervention may be considerable even if it had no effects in 
the intended directions, which might be cognitive or health 
gains; it may have employed many people, raised general 
awareness of problems, produced other gams. The absence 
of overall significant effects may also Jte due to dilution o£ 
good effects- in a pool of poor programs producing no ef- 
fects: one cannot infer from an overall-null to individual- 
nulls. For this reason, "lumping designs" are much less 
desirable than "splitting designs" in which separate studies 
are made rf many sites or sub-treatments (see Replication, 
Meta-ana.ysis.) Omega-statistics and Class' "standardized 
effect size" are attempts to produce measures that more 
nearly reflect true significance than does the p level of the 
absolute size Of the results. See Statistical Significance, 
Educational Significance. 

SIMULATIONS Re-crcations of typica a situations 
to provide a realistic test of aptitudes or abilities. See Clini- 
cal Performance Testing, Personnel Evaluation. 

SMILES TEST (of a program) People like it. Typical 
example of substituting wants for needs 

SOCIAL INDICATOR See Indicator. 

SOCIAL SCIENCE MODEL (of evalu&ton) The 
(naive) view that evaluation is an application of standard 
social science methodology One look at the usual social 
scientist's effort at needs assessment, or at the absence of 
one, when doing an evaluation, is enough to make clear 
whv this is naive. The relation of the social science model to 
the multidisciplinary mum-trait multi-field (3M) model im- 
plicit in manv entries in this work is like that between 
classical statistics and decision theory- the latter is a sub- 
stantial generalization ol the former; involves massive new 
research areas, applies better to practical cases, bridges to a 
larger number of other disciplines; but does not invalidate 
the former See Evaluation 

SOFT (approach to evaluation) Uses implementation 
data or the smiles test. See Hard. 

SOLE SOURCE "Sole-sourcing" a contract is an altern- 
ative to "putting it out to bid," via publishing an RFP Sole 
sourcmg is open to the abuse that the contract officer from 

ERIC 155 ,4b 



the agency may let contracts to his or her buddies without 
regard to whether the price is excessive or the quality un- 
satisfactory; on the other hand, it is very much faster, it 
costs less if you take account of the time for preparing RFPs 
and proposals in cases where a very large number of these 
would be written for a very complex RFP, and it is some- 
times mandatory when it is provable that the skills and/or 
resources required are available from only one contractor 
within the necessary time-frame. Simple controls can pre- 
vent the kind of abuse mentioned. 

SPECIALIST EVALUATOR See Local Expert. 

SPEEDED (tests) Also called power tests, or timed 
tests; those tests with a time limit (the time taken by each 
individual is usually not recorded). These are often better 
instruments for evaluation or prediction than the same test 
would be with no time limit — usually because the criterion 
behavior involves doing something under time pressure^ 
but sometimes, as in IQ tests, just as a matter of empirical 
fact. A test is sometimes defined as speeded if only 75 
percent finish in time. 

SPILLOVER (effects) See Trickle Effects. 

SPONSOR (of evaluation) Whoever or whatever funds 
or arranges funding or facilitates release of personnel and 
space: referred to as "instigator" in KEC Cf. Client. 

STAKEHOLDER An interested party in an evaluation 
e g a politician who supported the original program. 

STANDARD(s) The performa nee level associated with 
a particular rating or "grade" on a given criterion or dimen- 
sion of achievements; e g, 80 percent success may be the 
standard for passing the written portion (dimension) of the 
dn ver's license test A cutting score defines a standard; but 
standards can be given in non-quantitative grading con- 
texts, e.g by providing exemplars," as in holistic grading of 
composition samples 

STANDARD DEVIATION (Stat ) A technical mea- 
sure of dispersion, in a normal distribution, about two 
thirds of the population lies within one standard deviation 
of the mean, median, or mode (which are the same in this 
case ) The S D is simply the mean of the squares of the 

ER?C " r 15C 



deviations i.e. the distances from the mean. 

STANDARD ERROR OF MEASUREMENT (Stat ) 
There are several alternative definitions of this term, all of 
which attempt to give a precise meaning to the notion of the 
intrinsic inaccuracy ohy^instrument, typically a test 

STANDARD x S<j6RE Originally, scores defined as de- 
viations from the^nrtean, divided uy the standard deviation. 
{Effect-Size is an .example.) More casually, various linear, 
transformations of the above (Z-scores) aimed to avoid 
negative scores. 

STANDARDIZED TEST Standardized tests are ones 
with standardized instructions for administration, use 
scoring and interpretation, standard printed forms and con- 
tent, often with standardized statistical properties, that 
have been validated on a large s» *mp!eot a defined popula- 
tion. Thev are usually norm-referenced, at the moftient, but 
the terms are not synonymous since a criterion-refetenced 
test can also be standardized. Having the norms (etc ) on a 
' test does mean it's standardized in one respect, but it does 
not mean it's just a norm-referenced test in the technical 
sense; it may (also) be criterion-referenced, which implies a 
different technical approach to its construction and not just 
. different purpose. 

STANINES (or stanine scores) If you are perverse 
enough to divide a distribution into nine equal parts instead 
of ten (see decile), they are called stanines and the cutting 
scores that demarcate them are called stanine scores Thev 
are numbered from the bottom up. See also Percentiles. 

STATISTICAL S1GNIHCANCE (Stat ) When the dif- 
ference between/wo results is determined to be "statisti- 
cally significant/' th*? evaluator can conclude that the differ- 
ence is probably not due to chance The "level of signifi- 
cance" determines the^degree of certainty or confidence 
% with which we can rule out chance (i e rule out the "null 

hypothesis"). Unfortunately, if verv^arge samples are used 
even tiny differences become statistically significant though 
thev mav have no social, educational or other value at all 
Omega statistics provides a partial correction for this Cf 
Interocular Difference. The literature on the "significance > 
test controversy" shows why quantitative approaches pre- 

ERIC 157 m 



suppose qualitative ones, and is indeed an example of the 
evaluation of quantitative measures. See also Hypothesis 
Testing, Raw Scores. 

STEM The text of a multiple-choice test item that pre- 
cedes the listing of the possible responses. 

< STRATEGY Also called "decision function" or "re- 
sponse rule." A set of guidelines for choices which may be 
pr6uctciVi"mi^u/ conditional upon the "outcomes of early 
choices, oreven purely exploratory, i.e. preliminary to main 
choices. See Optimization. 

STRATIFICATION A sample is said to be stratified if it 
has been deliberately chosen so as to include an appropriate 
number of entities from each .of several population sub- 
groups. For example, one usually stratifies the sample of 
students in K-^ educational evaluations with regard to 
gender, aiming at 50 percent males and 50 percent females. 
If one selects a random sample of fern* les to make up half of 
the experimental and half of the control group and a ran- 
dom sample of males for the other half, then one has a 
"stratified random sample." If you stratify on too many 
variables you may not be able to make a random choice of 
subjects in a particular stratum — there may be no or only 
one eligible candidate. If one stratifies on very few or no 
variables, one has to tse larger random samples to com- 
pensate. Stratification is only justified with regard to vari- 
s ables that probably int^act with the treatment variable, and 
it only increases efficiency, not validity unless you do it in 
addition to using large lumbers i.e. abandon the efficiency 
gains it makes possible: jpdeed it runs some risk of reducing 
validity because you may not cover a key variable (through 
ignorance) and your reduced sample size may not take care 
of it. 

STRENGTHS ASSESSMENT Looking at resources 
available, including time, talent, funds, space, it defines the 
"range of the possible" and hence is important in both 
needs assessment and the identification of critical competi- 
tors, as well as in making remediation suggestions, and 
, responsibility evaluation. 

STYLE RESEARCH Investigations of two kinds; either 
descriptive investigations of the actual stylistic characteris- 



tics of people in e,g> certain professions such as teaching or 
managing; or investigations of the correlations between cer- 
tain style characteristics and successful outcomes. The sec- 
ond kind of investigation is of great importance to evalua- 
tion, since discoveries of substantial correlations would al- 
low certain types of evaluation to be performed on a process 
basis,, which currently^fcn only be done legitimately by 
looking at outcomesr (However; personnel evaluation 

could not be done in that way, even if the correlations were 

discovered,) The former kind of investigation — a typical 
example is studying the frequency with which teachers 
utter questions by comparison with declarative sentences or 
commands — is pure research, and extremely hard to justify 
as of either intellectual or social interest unless the second 
kind of connection can be made. In general, style research 
has come up with disappointingly few winners, (Actual s 
Learning Time is probably the most important and possibly 
the only exception ) No doubt the interactions between the 
personality, the style, the age and type of recipient and the 
subject matter prevent any simple results; but the poor 
results of research on interactions suggests that the interac- 
tions are so strong as to obliterate even very limited recom- 
mendations. We must instead fall back to treating positive 
results as possible remedies, not probable indicators of merit 

SUCCESS Goal-achievement of defensible goals. An 
evaluative term, but not one of the most important, Cf 
Merit, Worth. 

SUMMATIVE EVALUATION Summative evaluation 
of a program (etc ) is conducted after completion and for the 
benefit of some external audience or decision -maker (e g- 
fundingagencv, historian, or future possible users), though 
it may be done by either internal or external evaluators or a 
mixture For reasons of credibility, it is much more likely to 
involve external evaluators than is a formative evaluation- 
Should not be confused with outcome evaluation, which is 
simply an evaluation focused on outcomes rather than on 
process— which could be either formative or summative 
(This confusion occurs in the introduction to the ERS Evalu- 
ation Standards, 1980 Field Edition) Should not be con- 
fused with holistic evaluation —it may be holistic or 
analytic. 




SUNSET LEGISLATION Legal commitment to auto- 
matic close-out of a program after a fixed period unless it is 
specifically refunded. An intelligent recognition of the im- 
portance of shifting the<burden of proof, and thus analo- 
gous to zero-based budgeting. 

SUPERCOGNi 1 1 VE The j domain of performance on 
cognitive (or infonration/communication) skills that is a 
quantum jump beyond normal levels, e.g. speed reading, 
lightning calculating, memory mastery, speedspeak or fas- 
talk, tri-linguality, stenotyping, shorthand. Cf. Hyper- 
cognitive. 

SURVEY METHODS (in evaluation) See Evaluation 
Specific Methodology. 

SYMBOLIC EVALUATION Another term for Ritual- 
istic Evaluation. 

SYMMETRY of evaluative indicators. It is a common 
error to suppose (or unwittingly to arrange) that the con- 
verse or absence of an indicator of merit is an indicator of 
demerit. This is illustrated by the assumption that items in 
evaluative questionnaires can be rewritten positively or 
negatively to suit the configuralrequirement of foiling 
stereotyped responses. But "frequently lies" is a strong 
indicator of demerit, while "Does not frequently lie" is not 
even a weak indicator of (salient) merit. (Salient merit i,e. 
commendable behavior, is what one rewards, not "being 
better than the worst one could possibly be.") The preced- 
ing is an epistemological point about symmetry (related to 
the virtue/ supererogation dis^nction in ethics). There are 
also methodological asymmetries; for example, an item re- 
questing a report on absences e.g. "Was sometimes absent 
without leave" can be answered affirmatively by respon- 
dents who were often not there themselves but who ob- 
served one or more such absences by the evaluee; but "Was 
rarely absent without leave" will be checked "Don't know" 
by the same respondents since it calls for knowledge they 
do not possess. % 

SYNTHESIS (of studies) The integration of multiple 
research studies into an overall picture is a field whichjhas 
recently received considerable attention. These "reviews of 



™ ICO 



the literature" are not only evaluations in themselves, 
with— it turns out— some quite complex methodology and 
viable alternatives involved on the way to a bottom line; but 
~" they are also a key element in the e valuators repertoire 
since they provide the basis for identifying e.g. critical com- 
petitor's and possible side-effects See Meta-analysis. 

SYNTHESIS (in evaluation) 1. The process of combin- 

ing a_ set*of ratings on several dimensions into an overall 

evaluation. Usually necessary and defensible, sometimes 
inappropriate because it requrres a decision on relative 
weighting which sometimes is impossible. Those occasions 
require giving just the ratings on the separate dimensions. It 
is desirable to require an explicit statement and justification 
of the synthe- * procedure since this will often expose: (a) 
arbitrary assumptions; (p) inconsistent applications. In the 
evaluation of faculty-rioj- example, the de facto weighting of 
research vs. teaching is often nearer to^5:l in institutions 
whose rhetoric claims parity; but it may vary widely be- 
tween departments or between successive chairs in the 
same department The evaluation of student course work 
by the letter grade is often cited as an example of indefen- 
sible synthesis; tn fact it is a perfectly defensible summati ve 
evaluation, though it is unjui inable for formative feedback 
to the student. "Synthesis by salt* - summary" illustrates 
another trap, a teacher is rated on 35 scales by students and 
the printout only shows cases of statistically significant 
departures of the ratings from the norms. This seems plaus- 
, ible enough, but since the dimensions have not been inde- 
pendently validated, (and are not independent) it ndt only 
involves focusing on style characteristics which are being 
appraised on a prion grounds, but it also involves all the 
confusions of ranking instead of grading. The importance 
of correct synthesis is illustrated by a psvchiatnst on the 
staff at the Universitv^f Minnesota who became legendary 
for requesting a grant so that a graduate student could "pull 
his research results together", his "research results" beinga. 
complete set of taped recordings of five years of therapy 
F.valuators that are tempted to "turn the facts over to the 
decision-makers, and let them make the value-judgments" 
should remember that evaluations are interpretations that 
require ail the professional skills in the repertoire, a scien- 
tist's role dots not end with observation and measurement 



Weighted-sum synthesis is linear synthesis and usually 
works well. Rarely, as in the evaluation of backgammon 
board positions or in evaluating patients on the MMpI, we 
need non-linear synthesis rules Synthesis is perhaps the 
key cognitive skill in evaluation; it covers evaluating in- 
voked by the phrase "balanced judgment" as well as apples 
and oranges difficulties. Its cousins appear in tHe core of all 
intellectual activity; in science, n ot only in theorizing and^ 
identifying the presence of a theoretical construct from the 
data but in research synthesis. In evaluation, the wish to 
avoid it manifests itself in laissez-fair*evaluation'sextreme 
forms of the naturalistic approach. Balking at the final syn- 
thesis is often (not always) balking at the value judgment 
itself and close to valuephobia. 

2 Synthesis may also refer to the process of reconciling 
multiple independent 'evaluations. In this sense, it is a 
much-abused ind little-sttidied process of great impor- 
tance. For exampjejfdr^fe of the independent evaluations 
have to be submitteiTFo the client prior to the group syn- 
thesis session, the final results are very different from those 
where this requirement is not imposed (because of the need 
to fight for an already "public" concius* ,)> See Parallel 
Designs, Convergence. 

SYSTEMS ANALYSIS The term is generally used in- 
terchangeably with systems approach and system theory 
This approach places the product or program being eval- 
uated into the context of some total system. Systems anal- 
ysis includes art investigation of how the components of the 
program/product being evaluated interact and how the en- 
vironment (system) in which the program/product exists 
affect it The "total system" is not clearly defined, varying 
from a particular institution to the universe at large, hence 
the approach tends to be more an orientation than an exact 
formula and the results of its use range from the abysmally 
trivial to deep insights (Ref CW Churchman, The Design 
of Inquiring Systems). 

TA Technology Assessment An evaluation, particu- 
larly with respect to probable impact, of (usuallly new) tech- 
nologies Discussed in more detail under Technology 
below 



ERLC 



162 



TARGET POPULATION The intended recipients or 
consumers. Cf , Impacted Population. 

TAXONOMIES Classifications, most notably Bloom's 
taxonomies of educational objectives, a huge literature has 
grown up around these taxonomies, which are rather sim- 
plistic in their assumptions and excessively complex in their 
ramifications. But a useful sjart for an evah ation design. 

TEACHER EVALUATION Faculty are hot always 
teachers (sometimes thev are researchers, and sometimes 
just failures) and not all the teachers are on the faculty (e g. 
administrators, counselors and nurses). The "teachers" are 
normally staff who are supposed to teach but may have 
other duties. "Teacher evaluation" thus requires more than 
merely evaluating their teaching (see Personnel Evaluation, 
Style Research). A^-a first step, it requires identifying the 
other duties and weighting them relative to teaching. The 
evaluation of teaching itself requires evidence about: (a) the 
quality of what is taught (its correctness, currency and com- 
prehensiveness); (b) the amount that is learned; (c) the 
professionahty and ethicality of the teaching process. The 
ethics refers to e.g. justice in grading, and the avoidance of 
racism, favoritism ar.d cruelty; professionahty refers to the 
possession and use of appropriate skill* in e.g. discipline, 
the construction of test items, in spelling and in writing on a 
chalkboard or report card. There are twosurphse^F*rst, 
professionahty does not include most of wh3F goes into 
"methods" courses, because little of it has been validated. 
(The time would have been better spent trying to get the 
teachers' competence up in the subject-matter or testing). 
Second, professionahty not only includes the obligation to 
take workshops on new material/ theories/approaches to 
teaching, but the obligation to steady self-evaluation e g 
using student questionnaires, gain scores Whenever pos- 
sible, teachers should be evaluated on the amount of learn- 
ing thev impart to contfnmible classes, using identical tests, 
Html -graded and late-created (i e made up by random 
sampling from a large item pool with external validation) 
Thev should never be evaluated on the performance of their 
students when entry and support level is not controlled 
See Synthesis, Ref Handbook of Factdty Evaluation, ed 
JavMillman(Sa^e 0 1981) 

ERLC ™ 



TEACHING TO THE TEST The practice of teaching 
just or mostly those skills or facts that will be tested, based 
on illicit pnor knowledge of, or inference as to the test 
content If the test is fully comprehen&ve, e.g. testing 
knowledge of th # e "time tables" by calling for all of them, 
this is simply task-orientation and no crime. But most tests 
only sample a domain of behavior and generalize fronts 
performance on that sample as to overall performance in the 
domain, and that generalization will be erroneous when 
teaching to the test has occurred A senoys weakness of 
teacher-constructed tests is that they create the same situ^, 
tion ex post facto: see Testing to the Teaching. ^ 

TECHNOLOGY ASSESSMENT A burgeoning form of 
evaluation which atm^ to assess the total impact of (typi- 
cally) a nfcw technology A cross between futurism and 
systems analysis and consequently done at everv level from 
ludicrously, superficial to brilliant OTA usually scores 
above the middle of the possible range The process remains 
in need of svstematization, predicting that cassette re- 
corders would displace books was clearly fallacious at the 
time, while predicting that hand-held optical-scanning 
voice-input/output printing microcomputers will virtually 
eliminate the necessity for instruction in basic skills by 1990 
seems now (1980) to be so certain that the vast restructunng 
of the educational system which it entails should have long 
begun. One good feature of TA futurism might seem to be 
that in theJpng run we'll know who was nght, but so much 
of it relates to potential that refutation is hard 

TERROR The effect frequently induced bv goal-free 
evaluation (sometimes bv the thought of it) in the whole cast 
of actors— evaluators, managers, evaluees The "terror 
test" is the use of this awtul threat to determine whether the 
cast is competent 

TESTS (& TEST ANXIETY) Tests are poor instru- 
ments when the subjects are more anxious than thev would 
be in the criterion situation or when thev test a domain 
poorlv matched to the test's alleged domain, but thev are 
better than most observers, including the classroom teacher 
in manv, many cases 

TESTING TO THE TEACHING Designing tests to 

ERJC 164 



measure )ust what is actually taugh 1 testing learn- 

ing in the domain about which conciu. k x will be or need 
to be drawn Tests of a leading program that only use words 
actually covered in v >s will give a false picture of reading 
skills As with "teaching to the test/' this situation will not 
be improper in the extreme case where the teaching covers 
the whole domain. 

> 

TEST WISE Said of a subject who has acquired sub- 
stantial skills in test-taking e g. learning to say "False" on all 
items which say "always" or "never/' or (to give a sophisti- 
cated example) learning not to guess on items one hasn't 
time to think out, if a "correction for guessing" is being 
used, but to do so if it is not 

THEORIES General accounts of a held of phenomena, 
generating at least explanations and sometimes also predic- 
tions, often but not necessarily involving theoretical entities 
that are not directly observable/ A luxury for the evaluator, 
since they are not even essential for explanations, and ex- 
planations are not essential for (99 percent of all) evalua- 
tions It i^ a gross though froquenl blunder to suppose that 
'one needs a theory of learning in order to evaluate teach- 
ing " One does not need to know anything at all about 
electronics in order to evaluate electronic typewriters, even 
torrnativelv, and having sueh knowledge often advasclif 
affects a summative evaluation See Conceptual Scheme. 

THERAPEUTIC ROLE (OR MODEL) OF THE EVAL- 
UATOR The very nature of the evaluation situation 
Lreates pressures that sometimes mold it into a therapist- 
patient or group therapv interaction, particularly but not 
onlv true with regard to external evaluation. First, there 
is — in such a case - the client's feeling of having exhausted 
his her own resources, needing help badly, perhaps des- 
peratelv Second, there is the aura ot expertise and esoterica 
which surrounds the external expert Third, there are the 
technical diagnoses and magical ntes prescribed by the 
£ood doctor Since it's doubtful that there is usually muen 
more to psvchotherap\ than this, an amalgam which is 
enough to generate at least the placebo effect, the analogy is 
j ear _ tini j should be disturbing The main problem with 
placebo and Hawthorne effects is their transitory nature 
and the evaluator who fades back into the hills after an 



ecstatic client's testimonial dinner may have to sneak back 
for a look around a year la*er if s/he wants to get a good idea 
of whether the recommendations were, (a) solutions to the 
problems; (b) adopted; (c) supported. Hence follow-up 
studies, sadly lacking in psychotherapy research (or inno- 
vative evaluation) and often devastating when done, are 
just as important in meta-evaluation. 

TIME DISCOUNTING A term from fiscal evaluation 
which refers to the systematic process of discounting future 
benefits e.g. income for the fact that they are in the future 
and hence (regardless of the risk, an pssentially indepen- 
dent source of value reduction for merely probable future 
benefits) lose the earnings that those monies would yield if 
in hand now, in the interval before they will in fact material- 
ize Time discounting can be done with reference to any 
past or future moment, but is usually done by calculating 
everything in terms of true present value. 

TIME MANAGEMENT An aspect of management 
consulting with which the general practitioner evaluator 
should be familiar; it ranges from the trivial to 'he highly 
valuable.' Psychologists from William James to B.F. Skinner 
are amongst those who have made valuablecontnbutions to 
it and it can yield very substantial output gains at very small 
cost both for the evaluator and for clients or evaluees. It was 
James who suggested listing tasks to be done in decreasing 
order of enjovabilitv and beginning at the bottom, perhaps 
since that giv£s you the largest reduction of guilt and the 
bl 6g est g<* in in c harm for the remaining list. (Refs. James 
MtCay, The Management of Time, Prentice Hall.) 

TIME SERIES See Interrupted Time Series Analysis. 

TRAINING OF EVALUATORS Evaluators, like philos- 
ophers, and unlike virtually every other kind of profes- 
sional, should be regarded as having an obligation to 
know as much as possible about as much as possible While 
it is feasible and indeed quite common for evaluators to 
t specialize either in particular methodologies or in particular 
subject matter areas, the costs of doing this are always 
rather obvious in their work It is probably a consequence of 
the relative youth of evaluation as a discipline that the 
search for illuminating analogies from other disciplines is 

16C 




still so productive, but the other reason for versatility will 
always be with us, namely that it enables one to do better as 
an evaluator in as wide a range of subject matter areas as 
possible Columbia University used to have a requirement 
that students could not be accepted for the doctorate in 
philosophy unless they had a Master's degree In another 
subject, and an analogous requirement might be quite de- 
sirable in evaluation However, it is commonly asserted that 
the preliminary degree should be in statistics, tests, and " 
measurement. The problem with that requirement is that it 
leads to a strong preferential bias in the eventual practice of 
the professional. While skill in the quantitative methodolo- 
gies is highly desirable, it does not have to be a preliminary to 
evaluation training; the reverse sequence may be prefer- 
»lp A simple formula for becoming a good evaluator is to 
le.irn how' to do everything that is required by the Key 
Evaluation Checklist. Although the formula is simple, the 
task is not; but ii may be better to specify the core of evalua- 
tion training in this wav rather than by listing competencies 
in terms of *heir supposed prerequisite status with respect 
to evaluation People get to be good evaluators by a large 
number of routes, and the field would probably benefit by 
increasing this number rather than standardizing the 
routes See Evaluation Skills. 

TRAIT-TREATMENT INTERCONNECTION A less 
widely-used term tor aptitude-treatment interaction, 

though it is actually a more accurate term 

TRANSACTIONAL EVALUATION (Rippey) Focuses 
on the process ot improvement, e g. by encouraging anony- 
mous feedback for those that a change would affect *nd 
then a group process to resolve differences Though a 
poientiallv useful implementation methodologv in some 
cases, transactional evaluation does not help much with 
v g product evaluation or (in general) with the consumer 
effects of a program, being mamlv suiff-onented See Sum- 
mative Evaluation. 

TRANSCOGNITIVE Composed of Supercognitive 
and percognitivr. 

TREATMENT A term generalized from medical re- 
search to cover whatever it is that v\e're investigating, in 

1G7 m 



particular whatever is being applied or supplied to, or done 
by, the experimental group that is intended to distinguish 
them from the comparison group(s). Using a particular 
brand of toothpaste or toothbrush or reading an advertise- 
ment or testbook or going to school are all examples of 
treatments. "Evaluand" covers these, but also products, 
plans, and people etc. 

TRIANGULATION Originally the procedure used by 
surveyors to locate ("fix") a pointon a grid. In evaluation, or 
scientific research in general, it refers to the attempt to get a 
fix on a phenomenon by approaching it from more than one 
independently based route. For example, if you want to 
ascertain the extent of sex stereotyping in a company, you 
will interview at ;severat levels, you will examine training 
manuals and interoffice memos, you will observe personnel 
interviews and files, you will analyze job/sex/qualification 
matches, job descriptions, advertising, placement and so 
on. In short, you avoid dependence on the validity of any 
one source by the process of triangulation. Note that this is 
quite different from looking at multiple traits/dimensions/ 
qualities in order to synthesize them into an overall evalua- 
tive conclusion Triangulation provides "redundant" 
(really, confirmatory) measurement; it does not involve the 
conflation of ontologicaHy different qualities into estimates 
of merit (worth, value etc.) 

TRICXLE EFFECTS Indirect effects: spillover and rip- 
ple effects are rough synonyms. 

TRUE CONSUMER Someone who, directly or indir- 
ectly, receives the services etc. provided by the evaluand. 
Does not include the service providers though they are also 
part of the impacted population. Is usually a very different 
group from the target population (intended primary con- 
sumers ) 

TRUE EXPERIMENT A "true experiment" or "true ex- 
perimental design" is one in v\ hich the subjects are matched 
in pairs by groups as closely as possible and the one from 
each pair or one group is randomly assigned to the control and 
the other to the experimental group. The looser-and-larger 
numbers version skips the matching step and just assigns 
subjects randomlv to each group. (Cf. ex post facto design 



ERIC 



and quasi-experimental design 

TWO-TIER SYSTEM (also called Multi-Tier System, 
and Hierarchical System) A system of evaluation, some- 
times used in proposal evaluation (but also with consider- 
able potential in personnel evaluation) where an attempt is 
made to reduce the total social cost of the ordinary RFP^ 
svstem bv requiring two rounds of competition The first, 
which is the only one RFFd, involves stringent length re- 
strictions on the proposal, which is supposed to indicate 
just the general, approach and, e g , personnel available. 
These brief sketches are then reviewed by panels that can 
move through them very fast, and a small number of prom- 
ising ones are identified. Grants are (sometimes) made to 
the authors of this "short list" of bidders in order to cover 
their o)Sts in developing full proposals. The relatively small 
number of full proposals is then reviewed by a smaller 
group of reviewers or reviewing panels — th n second tier of 
the review svstem The mathematics of this vanes from case 
to case, but it's worth looking at an example. Suppose we 
simply put out the usual kind of RFP for improvement of 
college science teaching laboratories We might get back 600 
or 1,200 proposals, averaging perhaps 50 or 60 pages in 
length. For convenience let's say they average 50 pages and 
we get 1 ,000 of them. That's 50,000 pages of proposals to be 
read, and 50,000 pages of proposals to be written Even if 
reviewers can "read" 100 or 200 pages an hour, we're still 
looking at 250-500 hours of proposal reading, which means 
about 60 person-davs of reading, i e a panel of 15 working 
for four davs, two panels of 15 working two days, or ten 
panels of 6 working for one day Tne problem is that you 

in't get good reviewers for four days, and the small panels 
require more personnel to staff, and then have to face the 
serious problem of interpanel differences. Mow if we go to a 
two-tier svstem, then we can place an upper limit of, say, 
five pages on the first proposal and, although we may get a 
few more, that's a good result since it means that we'll get 
some entries who don't have the tim* 1 or resources required 
to submit massive proposals So we might start with 1,200 
five-pagers, which is 6,000 pages, and we've immediately 
gut a reduction of H8 percent in the amount of reading that's 
done, with the result that a single panel can reasonably 
manage it Then there will be perhaps ten or twenty best 

IN) 



proposals coming in at the 50 page length, which can be 
handled quite quickly, and indeed much more carefully, by 
the same panel, reconvened for that purpose. Notice also 
that the reading speed for the first tier of proposals may be 
higher since all the readers have to do is to be sure they're 
not missing a promising proposal, rather tha^ to rank-oHer 
for final award. And validity should be higher. Notice the 
triple savings that are involved: the proposers can save 
about 90 percent of their costs (it may not be quite so high, 
because shorter proposals take more than a prorated-by- 
page amount of resources, but it's still substantial); the 
agency saves a great deal of cost in paving raters or 
panelists, and heavy staff work costs; and the reliability of 
the process as well as the quality of available judges goes up 
significantly Hence the small subsidy for the second tier 
pre ia\ is more than justified, both fiscally and in terms of 
encouraging entries from people that couldn't otherwise 
afford it; and better entries from those that can 

TYPE 1/TYPE 2 ERRORS See Hypothesis Testing. 

UNANTICIPATED OUTCOMES Often used as a 
synonym for side-effects, Jhut only loosely equivalent, 
since outcomes may be unanticipated by inexperienced 
planners.but readily predictable by experienced ones; ef- 
fects that are anticipated but not goals are (sometimes) still 
side-effects— and sometimes not (e g. having to rent 
offices.) 

UNCERTAINTY, Evaluating. See Risk. 

UNOBTRUSIVE MEASUREMENT The opposite of re- 
active measurement One that produces no reactive effect, 
e g. observing the relative amount of wear on the carpet in 
front of interactive displays in a science museum as a mea- 
sure of relative amounts of use Sometimes unethical, and 
sometimes ethically preferable to obtrusive evaluation 
("Obtrusive" is not necessarily "intrusive", it may be obvi- 
ous but not disruptive ) 

UTILITY (Econ ) The value of something to someone 
or to some institution. "Interpersonal comparison of utility" 
is the stumblmg-block of (welfare) economics Sometimes 
measured in the hypothetical units of "utiles " See Appor- < 
tionment. Cost. 




UTILIZATION (of evaluations) This refers either to 
the effort to improve implementation of an evaluation's 
recommendations or to a metaevaluative focus on the extent 
to which evaluations have been utilized. Utilization/imple- 
mentation must be planned into evaluations from the first 
moment; indeed, if thetlient isn't in a position to utilize the 
results appropriately, an ethical question arises as to 
whether the evaluation should be done. Standard proce- 
dures include putting representatives of the evaluees on the 
evaluation team or advisory panel; soliciting and using sug- 
gestions from the whole impacted'population about design 
and findings, identifying and focusing on positive benefits 
of the evaluation if implemented; using appropriate lan- 
guage, length and formats in the report(s); establishing a 
balance of power to reduce threat; and, most importantly, a 
heavy emphasis on explaining/teaching about the particu- 
lar and general advantages of evaluation See also Im- 
plementation of Evaluations. Carol Weiss sensibly sug- 
gests "use" as a substitute for "utilization "Measures of use 
are tnckv, the problems begin with "conceptual use" i.e 

'The ideas caught on even if the recommendations were 
never implemented " Then there's the problem whether an 
implementation of bad suggestions should "score a point 
for our side " 

VALIDITY A test is valid if it really does measure what 
it purports to measure It can be reliable (in the technical 
sense) without being valid, and it can be valid without being 
credible But if it's valid it has to be reliable— if the ther- 
mometer is valid, it must say 100 degrees Centigrade when- 
iwr placed in boiling water and hence must agree with 
itself, i e be reliable. There are various subspecies of valid- 
ity in the jargon (especially face, content, construct, and 
predictive validity), but they represent an inflation of 
methodological differences into supposed conceptual dis- 
tinctions, except perhaps "face-valid" which possibly 
should be distinguished since it onlv means "looks valid " 
Serious investigation of validity will identify the appro- 
priate kind for e g the test being studied, one should not 
talk about "valid in f/n> sense, but not in that," onlv about 
"valid m the appropriate sense " Valid evaluatio are ones 
which take into account all relevant factors, and weight 
them appropriately in the synthesis process fhev mav or 



171 



lt>2 



may not be well-presented or credible (see Meta-evalua- 
tion). Validity originally referred to the property of argu- 
ments that are fogically impe> *ble, whether or not rhetori- 
cally impressive— the present sense is a natural extension, 
though it upsets logicians. See External Validity, General- 
ization. 

VALUED PERFORMANCE A value-imbued descrip- 
tive variable, imbued with value by the context For ex- 
ample, in the context of evaluating hot rods, the standing- 
start quarter-mile time is the principal evaluative measure, 
the valued performance. On the one hand it's totally 
factual/descriptive; on the other hand, it is contextually 
imbued with value and is treated exactly as if it logically 
involved the concept of merit. Cf Crypto-evaluative term. 

VALUE-FREE CONCEPTION OF SCIENCE The be- 
lief that science, and in particular the social sciences, should 
not or cannot properly infer to evaluative conclusions, on 
the basis of purely scientific considerations. Mistakenly as- 
sumed to be a consequence of empiricism though in fact it 
requires the further (erroneous) premise that inference from 
facts to values is impossible; the error is precisely analogous 
to the error of supposing that one cannot infer to conclu- 
sions about theoretical constructs from observations. (Pop- 
per's simplistic attack on induction is thus partly respon- 
sible for the continued support of the value-free doctrine ) 
Apart from the logical errors, there is the evidence of one's 
senses that science is redolent with highly responsible and 
well-justified scientific evaluation of research designs, of 
estimates, of fit, of instruments, of explanations, of research 
quality, of theories. That the value-free position was main- 
tained at all in the face of these considerations, requires an 
explanation in terms of valuephobia. See Needs Assess- 
ment and LE. 

VALUE-IMBUED TERM See Valued Performance. 

VALUE JUDGMENT A claim as to the merit, worth or 
value of something, originally (and still typically) involving 
judgment but then extended to cover all claims about value 
many of which are observational/ mensurational (as in mild 
assertions about outstanding athletic performances— "That 
dive is worth at lepst 6 points— as anyone could see") Since 



evaluation typically involves multi-attribute integration, it's 
not surprising that judgment is often involved— through 
the weighting of the various attributes. But the idea that 
value judgments must always be arbitrary/subjective/ 
unscientific was only built into the concept as the doctrine of 
value-free science took hold And it represents a further 
error, since multiple-attirubte measures can easily be objec- 
tively and pragmatically defined. See Relativism. 

VALUEPHOBIA The resistance to evaluation that gen- 
erated the myth of value-free science, the attacks on proper- 
ly-used testing or course grading (see Kill the Messenger), 
on program evaluations for accountability and on the evalu- 
ation of college faculty is often more than any rational expla- 
nation can cover. We use the term "valuephobia" to cover it 
without any implications of neurosis, just irrationality. Of 
course the natural defensive strategy (attack anything that 
is a threat) is part of it, but part of it goes deeper, into the 
unwillingness to face possibly unpleasant facts about one- 
self even if it means large longj^in benefits. (This phenom- 
enon—related to "denial"— is seen in people who won't go 
to a doctor because they don't want to hear about imperfec- 
tions). Valuephobia leads to many abuses e.g. pathetic 
guarantees that an evaluation will be done "only to help, 
not to criticize" (if there are no valid criticisms, there's rarely 
any justification for help of programs/performances in- 
volving professionals), jto the substitution of implementa- 
tion monitoring for outcome-based program evaluations; to 
the refusal of professional associations to use professional 
standards in their own accreditation or enfon ement pro- 
cedures; to excessive involvement of evaluation staff with 
the program staff ("to reduce anxiety or "to improve im- 
plementation" ") which frequently produces pablum evalu- 
ations, & (via guilt) to the absurd ratio of favorable to un- 
favorable program evaluations— absurd given what we re- 
ally know about the proportion of bad programs, The clini- 
cal status of valuephobia as a U S. cultural phenomenon is 
more obvious to a visitor from e.g England where very 
tough critici<fm in the academy is not taken personally to the 
degree *t is here, and it is in this country that Consumers 
Union was listed by the Attorney-General as a subversive 
organization and (independently) banned from advertising 
in newspapers But the ubiquity of valuephobia is more 

er|c 17 j ^ 



important; Socrates was killed for his teaching and applica- 
tion of evaluative skills and dictators today seem no less 
inclined to murder their critics than the Greek "democ- 
racy" Humility may best be construed not as the avoidance 
of self-regard but as the valuing of criticism, and this state 
should also be-valued as the outcome of successful "treat- 
ment" (hopefully educational rather than therapeutic) for 
valuephobia. It should be combined with some capacity to 
distinguish good from bad criticism. See Educational Role, 
Empiricism. 

VALUES (in evaluation; & measurement of) The values 
that make evaluations more than mere descriptions can 
come from a variety of different sources. They may be 
picked up from a relevant and well-tried set of e,g. profes- 
sional standards. They may come from a needs assessment 
which might show that children become very ill without a 
particular dietary component (i.e. need it). Or they may 
come from a logical and pragmatic analysis of the function 
of something (processing speed in a computer is a virtue, 
ceteris paribus.) They may even come from a sfrudy of wants 
and of the absence of ethical impediments to their fulfilment 
(e.g^in building a better roller-coaster.) In Wh of these 
cases, the foundations are factual and the Reasoning is 
logical— nothing comes in that a scientist should be 
ashamed of. But something hovers in the background that 
scientists are embarrassingly incompetent to handle, 
namely ethics. Without doing ethics, howevWr*zos/ evalua- 
tions can be validated by just checking for salient ethical 
considerations that might override the non-ethiJal reason- 
ing. The values/preferences that sometimes come into the 
evaluation as the ultimate data range in visibility/trom obvious 
(political ballots) to very inaccessible (attitudes towards job- 
security, women supervisors, cens^hip of pornography) 
Most instruments for identifying the more subtle ones a* of 
extremely dubious validity; they are best inferred from be- 
havior; although that inference is also difficult, it begins 
with the kind of event we are (usually) hoping to influence. 
Some simulations are so good that they probably elicit true 
values, especially if not very important ones are involved; 
usually behavior in real situations should be used See 
Ethics. 

ERiC m - 174 



VARIABILITY The extent to which a population is 
spread out over its range, as opposed to concentrated near 
one or a few places (or modes)— the feature that pn ^uces 
dispersion. 

WEIGHT AND SUM The traditional process of syn- 
thesis in evaluation: points are awarded for performance on 
each valued dimension e g. on a 1-5 scale, and the dimen- 
sions a^weighted for their relative importance (e g. on a 
1-3, 1-5, or 1-10 scale)— then the products of the weights 
and the performance scores are totalled for each candidate 
Although this is a very useful process, sometimes valid and 
nearly always clarifying, there are many traps in it Some 
examples: if a careful balance between maximum weight, 
maximum performance score and number of dimensions is 
not maintained„a bunch of trivia can swamp crucial factors; 
"no-compromise" requirements must be used as a prelimi- 
nary filter, not as heavily weighted factors; linearity of util- 
ity (points) across performance variables must not be as- 
sumed, interactions need separate analysts, surfihis perfor- 
mance on dimensions with no-curibromise thresholds 
must be separately weighted (possibly not very heavily); 
eventually, only pairwise comparisons should be made, 
though a TOulti-candidate table can be used as a crude first 
Hirer. See Evaluation Neivs (Vol 2, no. 1, February 1981, 
pp 85-90) 

WHITEWASH (Suchman) See Rationalization Eval- 
uation. 

WHOLISTIC Alternative spelling of Holistic. 

WHY DENY A conference with the staff of a funding 
agency which unsuccessful bidders on an RFPmay request 
and at which they are informed about the reasons why they 
lost out One of the consequences of the recent move to- 
wards openness. Unfortunately the failure to use salience 
sconng and other systematic procedures means that re- 
viewer and staff feedback is very difficult to interpret in a 
useful way. 

WIRED A contract or an RFP is said to be "wired" if 
either through its design and requirements or through an 
informal agreement between agency staff and a particular 

ER?C 

""*"" 1 i o 



contractor, it is arranged so that it will go to that contractor. 
Certainly illegal, and nearly always immoral. The mere fact 
that the RFP — with intrinsic good reasons— pro-deter- 
mines the contractor e.g. because the problem can in fact 
only be handled by an outfit with two Cray computers, does 
not constitute wiring. 

WORTH System value by contrast with intrinsic value 
(merit); e.g. market value is a function of the market, not of 
the evaluand's own virtues. The worth of a professor is a 
function of the enrollment in her or his classes, grant- 
getting, relation to the college's mission, role-modeling 
function for prospective/actual women or mmority stu- 
dents, as ivell as his/her professional r«erit. The latter is a 
necessary but not sufficient condition for the former. Cf. 
Success. 

ZERO-BASED BUDGETING (ZBB) A system of bud- 
geting in which all expenditures have to be justified rather 
than additional expenditures (i.e. variations from "level- 
funding"). Temporarily fashionable in Washington in the 
70s, its merits for summative evaluation are overwhelming; 
the practical difficulties are easily handled, but the political 
squawks from entrenched programs may be harder toman- 
age. The original reference is Peter Pyrrh's book of this 'title. 
See Apportionment. 



17G 

167 



ACRONYMS 
& 

ABBREVIATIONS 



AA Audit Agency— a division of HHS/ED that reports 
directly to the Secretary and does internal audits (cf. G AO) 
that amount to evaluations of programs and contracts in- 
cluding evaluationaL Has moved from CPA orientation to 
much broader approach and does mudi very competent 
work (though spread a little thin); still doesn't look at^.g. 
validity of test-instruments used. 

AAHE American Association of Higher Education 

ABT Properly, Abt Associates. Large shop with strong 
evaluation capability; headquarters Cambridge, Massa- 
chusetts 

ACT American College Testing 

AERA American Educational Research Association 

AID Agency for International Development 

AIR American Institutes for Research, a Northern Cali- 
fornia-based contractor with some evaluation capability 

ANCOVA Analysis of covariance 

ANOVA Analvsis of vanance 

ATI Aptitude-treatment interaction 

AV Audiovisual 

AVLINE Online audio-visual database maintained by 
the Nation 1 Library of Medicine 

ERIC 177 168 



v. 

CAI Computer-assisted instruction 

CBO Congressional Budget Office Provides analysis 
and evaluation services to Congress, as CAO does for'the 
administration 

CBTE, GETT, CBTP Competence Based Teacher Edu- 
cation, Training or Preparation 

CDC Control Data Corporation, one of the top five com- 
puter companies. 

CEEB College Entrance Examination Board 

CEDR Center for Evaluation, Development and Re- 
search (at Phi Delta Kappa) 

CFE Cost-tree evaluation 

CIPP Daniei Stuftlebeam and Egon Cuba's model 
which distinsuisfrwHoHTTVpys of evaluation, context in- 
put, prooesOTfd product-all designed to delineate,' ob- 
tain, and priHjde useful information for the decision- 
maker 

CIRCE Center for Instructional Research and Cur- 
riculum Evaluation, University ot Illinois, Urbana, Illinois 

CMHC Community Mental Health Center or Clinic 

CMI Computer Managed Instruction 

CN Consultants News, the highfv independent news- 
etter of the management consulting area, run bv talented 
loner jim Kennedy 

COB Close ot business (end of working day, a proposal 
deadline) r 

COPA Council on Post-Secondary Accreditation 

CRT Criterion-referenced test 

CSE Center tor the Studv of Evaluation (a* UCLA) 

CSMF Comprehensive School Mathematics Studv 
Cjroup 

DBMS Database Management Svstem Computer 
software 



m 178 



DEd (properly ED) Department of Education (ex- 
USOE) 

DOD Department of Defense 

DOE Department of Energy 

DRG Division of Research Grants 

DRT Domain-rc fenced test 

ED Education Department 

EIR Environmental Impact Report 

EN Evaluation News, the newsletter journal of the 
Evaluation Network 

en evaluation notes, * newsletter on evaluation meth- 
odology published by Edgepress 

ENet Evaluation Network, a professional association of 
eva lu? tors 

EPIE Education Products Information Exchange 
ERIC Educational Resources Information Center; a 
nationwide information network with 'ts base in V Washing- 
ton, D C and 16 clearinghouses at various locations in the 
U S Available as online database 

ERS Evaluation Research Society, a professional assoo- 
^:ion of evaluators 

ESE A Elementarv and Secondary Education Act of 1%5 
ETS Educational Testing Service, headquarters in 
Princeton, N J -bran hesm Berkeley, Athnta, etc 

FRACHE Federation of Regional Accrediting Commis- 
sions ot Higher Education 
FY Fiscal vear 

C & A General and administtation (expenses, costs) 
GAO General Accounting Office The principal **mi- 
external evaluation agena of the Federal government 

GBE Goal-based evaluation 
GFE Goal-tree eval jation 



GIGO Garbage In, Garbage Out (from computer prog- 
ramming; see meta-analysis) 

GP General-purpose (evaluator) 

GPA Grade-point average 

GPO Government Pn .ting Office, Washington, DC 

GRE Graduate Record Examination 

HEW Department of Health, Education and Welfare, 
nov divided into E D. and H H.S 

HHS Department of Health and Human Services 

IBM International Business Machines 

IOX Instructional C ectives Exchange 

K $1000 as in "16K for evaluation " 

K- 12 Kindergarten through high school years **" 

K-6 The domain of ' Elementary Education" 

LE The Logic of Evaluation, a monogiaph by the pres- 
ent authoi 

LEA Local Education Authority (e g school district) 

LSAT I aw School Admission Test 

M fhousand, as m"$16M for evaluation " 

MAS Management Advisory Services, term usually ref- 
ers to subsidiaries of the Big 8 accounting firms 

MBO Management bv Objectives 

MIS Management Information System usually a com- 
puterized database combining fiscal, inventory, and perfor- 
mance data 

MM?I Minnesota Multiphasic Personality inventory 

MOM Modus Opeiandi Mclhod 

NCATE National Council for Accreditation of Teacher 
Education 

NCES National Center for Educational Statistics 
NCHCT National Center for \ lealth Care Technology 

ERJC ten 



NIA National Institute onAging 

NICHD National Institute of Child, Health and Human 
Development 

NIE National Institute of Education (in ED) 

NIH National Institutes of Health (includes NIMH, 
NIA etc.), or Not Invented Here (so don't encourage its use) 

NIJ National Institute of Justice 

NIMH National Institute of Mer il Health 

NLM National Library of Medicine 

NSF National Science Foundatior 

NWL Northwest Lab, Portlar^, Oregon One of the 
federal network of labs and R & D centers; has strongest 
evaluation tradition (Worthen, Saunders, Smith) 

OE Office of Education 

OHDS Office of Human Development Sen ices 

OJT On-job training 

OMB Office J Management a ^.d Budget 

OPB Office of Planning and Budgeting 

ONR Office of Naval Research, sponsor of e g. Ency 
clopedia of Educational Evaluation 

OTA Office of Technology Assessment 

P & E Planning and Evaluation, a divisii of HEW/ 
HHS, including regional offices, where it repoi directly to 
Regional Directors In ED, currently called OPB 

PBTE Performance Based t eacher Education 

PDK Phi Delta Kappa, the influential and quality- 
oriented educational honorary that publishes The Kappan 

PEC Product Evaluation Checklist 

PERT Program Evaluation and Review Technique 

PHS Public Health Service 

PLATO Phc largest CM project ever, original head- 



quarters at the University of Ilhao.s/Champa.gn Mostly 
Nbr funded in development phase, now CDC-con trolled.' 
PPBS Planning Programming Budgeting System 

PSI Personalized System of Instruction (a k a. The Kel- 
ler Plan) 

FT Programmed Text 

RAND Big Santa Monica-based contract research and 
evaluation and policy analysis outfit. Originally a U S 
Air-Force 'creature' (civilian subsidiary), set up 'because 
they couidn t get enough spec.aiized talent from within the 
ranks-name came from Research ANd Development 

for L/SAF ePendent n ° n " profit ' thoUgh stlH does *"* 
RFP Request for proposal 

SAT Scholastic Aptitude Test Widely used for collect 
adm/ssions 

SDC Systems Development Corporation m Santa 
Monica, another large shop l,ke Rand with substantial eval- 
uation capability. 

SEA State Education Authority 

SEP School Evaluation Profile 

SES Socioeconomic status 

SMERC San Mateo Educational Resources Center— ar, 
information center which houses numerous collections of 
educational materials to meet the information needs of edu- 
cators m several states surrounding California Most collec- 

n nS c\ r !ri n " h0USe " b,t SMERC ' llh ° h ' ls ^ s> > to ERIC 
files SMFRC is located in Redwood Citv, CA, 

SMSG Si h»H>l Mathematics S-udv Group One of the 
earliest and most prolific of the federal curriculum reform 
efforts 

SRI Originally Stanford Research Institute, in M«n!o 
I ark. CA. once part-owned bv Stanford University now 
autonomous | lir gc "shop" V vh,c h does some e\ aluation 

TA Ttn hnologv Assessment or Technical Assistance 



ERIC 



182 



TAT Thematic Apperception Test 
TCITY Twin Cities Institute for Talented Youth Site of 
the first advocate-adversary evaluation (Stake & Denny) 

US AF United States Air Force. Heavy (and pretty com- 
petent) R&D commitment, like Navy, and unlike Army. 

USDA United States Department of Agriculture 

USOE United States Office of Education, now ED or 

DEd (Departmen of Education) 

WICHE Western Interstate Clearinghouse 3n Higher 

Education 



